Article Thumbnail

Titan X Performance: PCI-E 3.0 x8 vs x16

Written on September 8, 2016 by Matt Bach
Share:
Table of Contents:
  1. Introduction
  2. Test Setup
  3. Unigine Heaven Pro 4.0
  4. Ashes of the Singularity
  5. GRID Autosport
  6. Davinci Resolve
  7. Octane Render
  8. Conclusion

Introduction

A number of years ago, we published the article "Impact of PCI-E Speed on Gaming Performance" investigating whether you would see a difference between a video card in a PCI-E 3.0 x8 slot versus one in a x16 slot. At the time, we found that even the fastest video card available had nearly identical performance whether it was running at x8 or x16.

In the three years since that article was published, however, video cards have become much faster, higher resolution displays (including 4K surround) are more common, and video cards have become increasingly useful in professional applications. Because of these factors, we have decided to revisit this topic using the latest hardware to see whether there is now a measurable difference between PCI-E 3.0 x8 and x16.

Test Setup

For our test system, we used the following hardware:

To see whether there is a difference between PCI-E 3.0 x8 and x16, we tested with both a single GPU and dual GPU (in SLI where appropriate) with three gaming benchmarks and two professional applications - each of which are highly GPU dependent. These applications include Unigine Heaven Pro 4.0, Ashes of the Singularity, GRID Autosport, Davinci Resolve, and Octane Render. In order to thoroughly test the different PCI-E configurations you may run into, we tested single and dual Titan X configurations at full x16 speeds as well as limiting them to x8 by covering half the PCI-E contacts with a piece of insulating material (paper). The PCI-E configurations we tested are:

 
 
  • Single GPU in PCI-E 3.0 x16
  • Single GPU in PCI-E 3.0 x8
  • Dual GPU in PCI-E 3.0 x16/x16
  • Dual GPU in PCI-E 3.0 x16/x8
  • Dual GPU in PCI-E 3.0 x8/x8

Unigine Heaven Pro 4.0

Starting out our testing is Unigine Heaven Pro 4.0 which is a fairly standard gaming benchmark. It is beginning to be a little aged at this point, but it is still one of the best and most consistent DirectX 11 benchmark we know of. To test a range of different displays, we included results for 1080p, 4K, and 4K surround with a variety of quality settings.

Unigine Heaven 4.0 x8 vs x16 single GPU

With a single Titan X, there was little benefit (less than 2%) in most cases to running the card at x16 instead of x8. In fact, with a 1080p display, using PCI-E 3.0 x8 was actually about 9% faster than x16! This result is odd and not what we expected, but we ran the benchmark multiple times and got the same result over and over.

Unigine Heaven 4.0 x8 vs x16 SLI GPU
Moving on to the dual GPU results in SLI, we saw effectively no difference at 1080p and 4K, but we did see some significant differences when using 4K surround. At this extremely high resolution (11520x2160), we saw no difference at ultra quality settings, but as we turned down the settings (which resulted in a higher FPS), running the cards at x16 became more and more important. When we were getting around 45FPS, x16/x8 was still only about 1% slower than x/16/x16, but x8/x8 was almost 4% slower. Lowering the settings to the point that we were getting roughly 100FPS, we saw a 15% drop in performance with x16/x8, and a massive 30% drop in performance running at x8/x8.

Ashes of the Singularity

Ashes of the Singularity was one of the first DX12 games to include a built-in benchmarking utility and is especially interesting because the developers specifically say to take dual GPU configurations out of SLI and run them as two seperate cards due to how DX12 handles multi-GPU configurations

Ashes of the Singularity x8 vs x16 single GPU

With a single GPU, we saw fairly consistent results across the different resolutions. In all cases, running the card at x8 was about 2-5% slower than running at x16.

Ashes of the Singularity x8 vs x16 dual GPU
With two Titan X video cards, we had a little bit of trouble getting 4K surround to work properly - in fact, we were unable to successfully take the video cards out of SLI at all with surround enabled. Even physically removing the SLI bridge did not work as it simply made the cards give driver errors in device manager.

At 1080p, all the results were within 2% (and x8/x8 was faster than x16/x8) so we are inclined to call them all to be within our margin of error. At 4K, however, we saw about a 2.5-3.5% drop in performance with the cards running in x16/x8 and x8/x8.

GRID Autosport

GRID Autosport x8 vs x16 single GPU

With a single GPU, the results in this game were surprisingly consistent. Across the different resolutions we tested, GRID showed about a 5% drop in performance when running at x8 versus x16.

GRID Autosport x8 vs x16 SLI GPU
Unlike the single GPU results, when using two Titan X cards in SLI the results not nearly as consistent. At 1080p, running at x16/x8 was actually the fastest, beating x16/x16 by about 10%. Similarly, x8/x8 was also faster than x16/x16, but only by about 8%.

Increasing the resolution to 4K gave us results more in line with what you would expect. At this resolution, x16/x8 was about 2% slower than x16/x16 and x8/x8 was about 8.5% slower. Finally, with 4K surround we saw about a 5% drop in performance with x16/x8 and a 6.5% drop in performance with x8/x8.

Davinci Resolve

Davinci Resolve is a non-linear video editing and color correcting suite that makes heavy use of GPU acceleration for things like live video playback and exporting of video. While this makes it a great professional application to test PCI-E speeds, the one downside is that the benchmark file we are using (Standard Candle) to test playback speed is limited to 24 FPS and the built-in FPS counter can only give results in whole numbers. This means that while we may see some performance differences, we need to keep in mind that what appears to be a 5% difference in performance may actually be less due to the whole number limitation.

Davinci Resolve x8 vs x16 single GPU

The Standard Candle test file actually has 8 different Blur and TNR node tests, but since the simpler configurations gave a flat 24FPS, we decided to keep them out of our charts to help cut down on noise. Interestingly, while we saw a 1-2 FPS drop in performance with 30 Blur nodes and 4 TNR nodes, we did not see a drop in performance with 66 Blur nodes or 6 TNR nodes. This is likely due to the fact that Davinci Resolve only shows FPS in whole numbers, so we simply don't have a fine enough scale to see the difference.

Davinci Resolve x8 vs x16 dual GPU
With two Titan X video cards, we saw a small 1 FPS drop in performance with x8/x8 with 66 Blur nodes and 6 TNR nodes. However, with 4 TNR nodes we saw a 2 FPS drop with x16/x8 (roughly 8%) and a 3 FPS drop with x8/x8 (about 12%). While the whole number limitation keeps us from being super precise, it is still clear that there is indeed a benefit to using x16/x16 over 16/x8 or x8/x8 in Davinci Resolve.

Octane Render

Octane Render is one of the first GPU-based rendering engines to be developed and uses the GPU only (not the CPU) for rendering and has extremely good scaling across multiple GPUs. We've looked at x8 versus x16 performance in Octane only a few months ago in our Octane Render GPU Comparison article and while we saw no difference at that time, it is possible that the new Titan X will be fast enough to show a difference. One important thing to note is that full support for the new Pascal video cards (including the Titan X) is still not present in Octane. They work just fine and are still faster than the previous generation cards, but once CUDA 8 is released by NVIDIA there are expected to be a number of optimizations to even further increase performance.

Octane Render x8 vs x16 single GPU

With a single GPU, we interestingly saw about 2% faster performance with x8 over  x16. This is odd, but is actually almost identical to what we saw in our Octane Render GPU Comparison article linked above.

Octane Render x8 vs x16 dual GPU
Moving on to two Titan X cards, the results are all pretty much the same. There were small changes in performance, but nothing larger than 1% so they are well within our margin of error

Conclusion

Overall, the results of our testing is pretty mixed. With a single Titan X, we saw a wide range of results between using a PCI-E 3.0 slot at x8 and x16. Some applications (Unigine Heaven Pro and Octane Render) showed no difference, while others (Ashes of the Singularity, GRID Autosport, and Davinci Resolve) showed up to ~5% difference in performance. 

With dual GPUs, the results actually got a bit more confusing. Although Unigine Heaven Pro didn't see much of a difference with a single card, with two cards in SLI driving three 4K displays in surround we saw roughly a 15% drop in performance running at x16/x8 and a massive 30% drop in performance running at x8/x8. On the other hand, Ashes of the Singularity only showed minimal differences, and GRID Autosport was actually faster at 1080p when running in x8/x8 - although it was about 8% slower at 4K and 4K surround. On the professional side, Octane Render still didn't show a difference when using two cards but Davinci Resolve did see up to a ~10% drop in performance with both x16/8 and x8/x8.

While our results are somewhat inconsistent, there are still a couple of conclusions we can draw:

  1. Whether you will see lower performance with x8 versus x16 is going to highly depend on the application. Some may see a difference, others won't.
  2. The higher the load of the card(s) - either through a higher resolution in games or a large number of accelerated effects in professional applications - the higher the chance of there being a difference in performance

At this point, we would say that if you are using a high end video card like the Titan X (Pascal) or possibly even a GTX 1080, it is probably a good idea to try to use a PCI-E 3.0 x16 slot - especially in multi GPU configurations. Depending on the applications you use it may not make much of a difference, but the fact that we saw a 10-30% drop in performance with x8/x8 compared to x16/x16 in a couple of our tests just goes to show how large of a difference it can make in certain situations.

Tags: Titan X, PCIe, Performance
Nick Dedman

What about loading times for games? This will be the time that the PCIe bus is saturated the most, as during gameplay, most of the data is already in the memory.

Another example of when PCIe bandwidth can help is in low vram situations. I never used to be able to run the 4k textures in Shadow of Mordor with my 3gb 780 on a PCIe2 based system. It was far too jerky after a few mins of play. Upgrading to a PCIe3 based system (which also has over twice the system ram bandwidth) allows me to do this (and maintain over 60 FPS) with the same GPU/Vram for as long as I want.

This is because windows will swap out excess video data into the system ram, over the PCIe bus.

Posted on 2016-09-09 22:13:43
Streetguru

I know you guys don't use AMD for whatever reason, but the Radeon Pro Duo would have made a lot of sense to test given that ya know, it's a dual GPU card and all

Posted on 2016-09-13 03:07:24
Shadownet

AMD is a whole different beast. They have allowed in the past PCIE 4X in triple card configurations, but considering this is a test of bandwidth, I would say that a similar result should be expected of their cards in said configurations, probably the biggest hit being running 3 cards at high resolutions.

Posted on 2016-09-13 03:18:29
Streetguru

Just would have been the most extreme situation possible, especially if they used 2 of them, plus AMD cards don't use a bridge anymore either, so that's even more saturation of the lanes

even X4 though shouldn't be too big a deal, at least for mid range cards maybe

Posted on 2016-09-13 03:26:33
Shadownet

That's a very interesting result, thanks for the update, I was just reading the older one the other night. I guess finally bandwidth and games caught up on multiple cards. The lane configuration of two x16 also limits the boards and cpus, which makes it a very expensive proposition for the time being.

Posted on 2016-09-13 03:22:36
Wascally

How would a plx chip affect the result? For example, a x99 motherboard with 40 lane CPU with a plx chip set-up to give 16x/16x SLI but more than 40 lanes utilized (two gpus, multiple m.2/u.2's, etc.). Thanks!

Posted on 2016-09-14 05:12:19
Jay Jardin

Thanks a billion. Thanks to this I had a reason to buy a 40 lane processor. I am super happy with my machine.
https://www.facebook.com/jay.j...

Posted on 2016-10-16 19:14:46
David Keller

Awesome benchmark. I was searching for a benchmark for PCI-E speeds with Pascal GPUs. Interesting results. I'm not gonna spend $1000 on an X99 system to gain 1-2 fps to run games at 4K. If you're running 4K surround then it may be a good idea. I'm currently at PCI-E 3.0 x8/x8 with my z97 OC Formula, 4790K and SLI 1080s. With all the games lately that don't take advantage of SLI, I have been wondering, since my mother board has PCI-E On/Off switches. If I might get better performance to just disable the bottom PCI-E slot so that my top slot will run at x16 with SLI disabled. But according to the benchmarks, it doesn't seem like it would be worth the hassle.

Posted on 2016-10-29 18:54:04