Article Thumbnail

Video Card Failure Rates by Generation

Written on May 2, 2014 by Matt Bach
Share:
Table of Contents:
  1. Introduction
  2. NVIDIA Failure Rates
  3. AMD Failure Rates
  4. Conclusion

Introduction

We've said it before: at Puget Systems, we love data. Almost since Puget Systems was founded, we've been recording a variety of data including failure rates for every single part we sell. We use this data in a variety of different ways - to get a handle on failure trends, to help us decide which brands we want to sell, or simply to answer a customer's question on what part they should purchase. In fact, we try to publish a list of the most reliable hardware every year in our Most Reliable Hardware articles.

Today we want to take a look at some of our reliability data to answer a question that came up in one of our department meetings: are video cards getting more or less reliable? There are times when we go through a rough patch where it feels like video cards are failing left and right and we start to pine for the "good old days". Then, we remember all the problems we used to see with older cards. To get a more accurate answer this question, we decided to examine our GPU failure logs and break down the numbers by generation.

We on average sell just under 2000 video cards per year so we have a pretty large sample size with which to base our results, but this data may not be 100% accurate to the industry as a whole. In addition, one factor that you must take into consideration is that we tend to sell more mid and high end cards than the average PC company which likely makes our failure rates a bit higher than normal. Also, we put all of our systems through a very rigorous testing process which is specifically designed to make any card that is close to failing actually fail while it is still at our facility. Because of these two factors, our failure rate is likely higher than the industry average.

Edit 5/5/2014: We have received some questions about our testing procedure. Our entire build process (including testing) can be found here. Our phase 2 testing process is where we tend to find the most problems with video cards, however, and includes game benchmarks (currently X3: Terran Conflict, AvP: Alien vs Predator, and Unigine Heaven Pro 4.0) as well as what we call our "stress test". This test consists of running Prime 95 and Furmark on the system at the same time to put the system under an extreme load. It is possible that Furmark will break a video card, but we actually only run it for ~5 minutes at a time. If a card is not able to survive a stress test for that short amount of time, we want it to break because that is a clear indication that it is not a strong card and will likely fail in the near future.

NVIDIA Failure Rates

When dealing with part failure data, we tend to seperate the data into two categories: failures during our testing (when we initially build and test the machine) and failures in the field which covers anything that fails after the machine has initially reached the customer. Overall, we would rather see a lower rate of failures in the field than failures during our testing since we would prefer to isolate our customers from any problems as much as possible.

On thing we want to point out is that this data is taken only from the cards we sell. Since we never sold any GT 5xx or 7xx (which don't actually exist right now), we could not include those cards in our failure numbers. If you want to see our failure data with only GTX video cards, you can do so here. Overall, however, the data is almost identical.

With that said, NVIDIA video cards have seen a pretty steady decline in failure rates over the last five generations with the exception of the GTX 5xx series. With that series, NVIDIA regressed a little bit, but they well made up for it in the GT/GTX 6xx series. The failure rate in the field for that series in particular is especially impressive at only 1.57%

The current generation - GTX 7xx and Titans - has so far been better than any other generation to date. The .64% in the field failure rate is like a little lower than it eventually will be since the cards have not been out for as long as the other generations, but the 1.61% failure rate during our initial testing is very, very good.

So for NVIDIA, the current cards really are more reliable than the previous generations. There is always the chance that you will get a defective card, but our records indicate that NVIDIA is certainly making a great effort to make their cards as reliable as possible.

AMD Failure Rates

Unlike NVIDIA, our failure logs show that the latest generation of AMD cards is currently seeing an increase in failure rates. The good news is that most of the Radeon R7 and R9 cards that are failing are doing so in-house, so our customers are mostly isolated from the problems. Historically, AMD has about a 50-50 DOA to field failure rate, so we are actually very happy to see that ratio shifting away from failures in the field. However, a 13.46% failure rate during our testing is really, really high and indicates that there is a problem with the latest AMD video cards.

The main question is if this is a temporary problem (like NVIDIA's GTX 5xx series) or if it is part of an overall larger problem. Unfortunately, only time will be able to tell.

Edit 5/5/2014: We've received some questions about what brands and models of card we used. We primarily use Asus DirectCU cards whenever possible, and the basic Asus models when there are not DirectCU versions available. It is possible that the high failure rates are limited to Asus cards, but we have used Asus as our primary supplier for video cards for a long time now. This includes NVIDIA cards as well as the Radeon HD 7xxx and Radeon HD 6xxx cards which have a much lower failure rate than the Radeon R7/R9 cards. This is a clear indication to us that Asus is not the problem, but rather something fundamental to the R7/R9 cards themselves.

Conclusion

Like anything, there are a number of different conclusions you can draw from this data depending on your point of view. NVIDIA, however, is pretty straight forward. Both the DOA and in the field failure rates have been pretty steadily improving since 2009. They had a bit of a setback with the GTX 5xx series, but their current failure rates are currently at an all-time low.

AMD, however, is a bit of a mix. From the Radeon HD 4xxx series to the Radeon HD 6xx series, AMD had an overall steady improvement in failure rates. The Radeon HD 7xxx series was pretty much the same as the generation before it, but the latest generation has shown a huge increase in failure rates. The silver lining is that the majority of the failures are of the type that we can catch in house before the system makes it to the customer. So if you already have a Radeon R7 or R9 card that is working well, the chances are good that you will not have a problem in the future.

Tags: GPU, video card, NVIDIA, AMD, failure rate
SAimNE

by the looks of the graph i would question if it's actually a design flaw of if amd just really upped it's standards for a passing card.... possibly since they were backed into a corner financially they needed as little bad press as possible and thus made a HUGE renovation to their stress testing pass requirements o.o

Posted on 2014-05-03 01:57:34
Boogie Man

Any chance of pinpointing the mode of failure? I would guess most of the modern card failures were due to VRM failure or Fan failure.
I would imagine it is easy to guess why the NVIDIA reliability declined with the GTX 5xx series which were generally less over-engineered than their GTX 4xx counterparts due to the leveraging of the transistor improvements for power savings. In particular was the GTX 570 which had issues with the 4 phase VRMs.
The GTX 6xx and 7xx series were much more power efficient in addition to having more robust VRMs thus I can easily see why reliability is so good.

As for AMD's high failure rate.
There have been reports that the ASUS Direct CU cooler being slightly sub-optimal for the R9-270X and R9-290/X series.
I have personally seen the early ASUS R7-270X having less auxiliary component cooling than the competitors (no active RAM cooling, VRMs have no heatsinks etc). Whether this is a problem I can't say because the ASUS electronics are pretty robust.

The R9-290X ASUS cooler seems to be originally designed for the 780Ti so there might be weaker core cooling performance but still far superior to the Reference. However, the reference cooler does have superior VRM cooling to most custom designs to date.

I would be very interested as to the mode of failure on those AMD cards. The core is supposedly able to take a lot more heat but I'm not so sure about the VRMs that have to supply the massive power requirements. It might not even be a heat issue, a lot of watercooled GTX 570s died in the past because the 4 phase VRMs could not safely supply the current required when the TDP was exceeded.

Posted on 2014-05-06 00:47:53
NoldorElf

"We've received some
questions about what brands and models of card we used. We primarily use
Asus DirectCU cards whenever possible, and the basic Asus models when
there are not DirectCU versions available. It is possible that the high
failure rates are limited to Asus cards, but we have used Asus as our
primary supplier for video cards for a long time now."

Matt, I may have an explanation then for the R9 290s failing. The Asus Direct CU II versions of the 290 r9 have been known to have very poor VRM cooling. The DCU II cooler was the same as the one on the 780, only the PCB layout on the 290 changed substantially compared to that of the 780 GTX, resulting in inadequate coverage of the core. More dangerously, the VRMs were known to overheat at times.

What kinds of temperatures are you getting on your core and VRMs are load (please indicate room temp, load temp, and benchmark used)?

The other reason may simply be that the cards themselves, the R9 series run very hot. They use a lot of electricity - not as much as Fermi, but approaching that level.

Posted on 2014-06-28 16:37:23

We've started to use some XFX cards to determine if the high failure rates were an AMD or Asus issue, and so far our first batch of XFX cards were only slightly better than the Asus cards. This was just a batch of 10 R9 280X cards though, and they haven't even made it through our full assembly process so it is really too early to tell if using a different brand than Asus is going to help or not.

I'm not sure about exact VRM temperatures since we don't log that currently, but I believe we usually see about 85C being reported when we run a combination of Prime95 and Furmark (our ambient is about 20-21C). The core temperature is usually really high under load (about 94-95C) but there isn't too much we can do about that. Asus has the fan ramping set to not ramp up much until the card hits around 94C so that is simply what the card is designed to run at during load. So even if we really overdo it on chassis airflow, the GPU fan is simply going to ramp way down and still let the core get to 95C.

If I had to make a guess, I would think that the biggest contributor to the high failure rates is simply the fact that the R9 cards run really hot. It doesn't even seem to be a question of hot much electricity they use, but rather than the fan profiles allow the cards to get really hot.

Posted on 2014-06-30 19:31:15
Casecutter

That’s a very "simpletons assessment" to conclude anything, especially to state it is "fundamental to (ALL) the R7/R9". This more steers me to doubt Asus and their ability to "discern" in their final testing what issues might plague what they send out the door.

It's very short-sided to conclude it's any "indication" of the AMD GPU, without some further drill-down to the actual issues. The chance that the GPU "chip" is defective is very low (as both AMD and Asus) will dyno/check and sort prior to surface mounting to the PCB, more often something that "immediate" would indicate a sub-system issue on the PCB Asus engineers.
Or...
This kind of "advertising" almost smell like some sort of "Tier 0" stunt.

Posted on 2014-10-01 18:24:53
Samuel

I had a r9 280x ... and it did much Blue Screens. So yes. I guess, they are right with the high failure right. Beside that the drivers are very very shitty, i would not buy a AMD ever again.

Posted on 2015-01-13 08:51:42
Khalifa Aouameur

hi i wanna buy msi r9 290 gaming so i wanna know if those failure ratio are hight and if i can fix them or ?! like black screen artifacts maybe fan stop hight temp
please help me i am confused and fear to buy i hope those failure we can fix them easily ?

Posted on 2015-01-16 02:03:02
Alcide Ford Jr

I think the thing everyone is overlooking is that Nvidia cards can't mine digital coins. The Bitcoin rush had miners pushing AMD cards to the max for hours on end.

Posted on 2015-01-25 15:46:46
Rafael Luik

Those are brand new cards coming from the factory / distributor, they were not used.

Posted on 2015-05-18 06:22:32
Prince Chèn

Ironic that I haven't had an NVIDIA GPU since 2008 and had not even one problem with an AMD one, and my first new NVIDIA GPU which is part of their new line up, is showing signs of failure while being certified by EVGA. I didn't buy it new however, but still it stands, it was refurbished by the company that makes the graphics card, to make it reliable as a new one. I could have bought a new one, but I expected Manufacturer refurbished to be basically the same as new.

Posted on 2016-02-27 04:04:14
TipTop

That's what you get for buying cheap crap like AMD.

Posted on 2016-07-29 09:46:46