We've said it before: at Puget Systems, we love data. Almost since Puget Systems was founded, we've been recording a variety of data including failure rates for every single part we sell. We use this data in a variety of different ways - to get a handle on failure trends, to help us decide which brands we want to sell, or simply to answer a customer's question on what part they should purchase. In fact, we try to publish a list of the most reliable hardware every year in our Most Reliable Hardware articles.
Today we want to take a look at some of our reliability data to answer a question that came up in one of our department meetings: are video cards getting more or less reliable? There are times when we go through a rough patch where it feels like video cards are failing left and right and we start to pine for the "good old days". Then, we remember all the problems we used to see with older cards. To get a more accurate answer this question, we decided to examine our GPU failure logs and break down the numbers by generation.
We on average sell just under 2000 video cards per year so we have a pretty large sample size with which to base our results, but this data may not be 100% accurate to the industry as a whole. In addition, one factor that you must take into consideration is that we tend to sell more mid and high end cards than the average PC company which likely makes our failure rates a bit higher than normal. Also, we put all of our systems through a very rigorous testing process which is specifically designed to make any card that is close to failing actually fail while it is still at our facility. Because of these two factors, our failure rate is likely higher than the industry average.
Edit 5/5/2014: We have received some questions about our testing procedure. Our entire build process (including testing) can be found here. Our phase 2 testing process is where we tend to find the most problems with video cards, however, and includes game benchmarks (currently X3: Terran Conflict, AvP: Alien vs Predator, and Unigine Heaven Pro 4.0) as well as what we call our "stress test". This test consists of running Prime 95 and Furmark on the system at the same time to put the system under an extreme load. It is possible that Furmark will break a video card, but we actually only run it for ~5 minutes at a time. If a card is not able to survive a stress test for that short amount of time, we want it to break because that is a clear indication that it is not a strong card and will likely fail in the near future.
NVIDIA Failure Rates
When dealing with part failure data, we tend to seperate the data into two categories: failures during our testing (when we initially build and test the machine) and failures in the field which covers anything that fails after the machine has initially reached the customer. Overall, we would rather see a lower rate of failures in the field than failures during our testing since we would prefer to isolate our customers from any problems as much as possible.
On thing we want to point out is that this data is taken only from the cards we sell. Since we never sold any GT 5xx or 7xx (which don't actually exist right now), we could not include those cards in our failure numbers. If you want to see our failure data with only GTX video cards, you can do so here. Overall, however, the data is almost identical.
With that said, NVIDIA video cards have seen a pretty steady decline in failure rates over the last five generations with the exception of the GTX 5xx series. With that series, NVIDIA regressed a little bit, but they well made up for it in the GT/GTX 6xx series. The failure rate in the field for that series in particular is especially impressive at only 1.57%
The current generation - GTX 7xx and Titans - has so far been better than any other generation to date. The .64% in the field failure rate is like a little lower than it eventually will be since the cards have not been out for as long as the other generations, but the 1.61% failure rate during our initial testing is very, very good.
So for NVIDIA, the current cards really are more reliable than the previous generations. There is always the chance that you will get a defective card, but our records indicate that NVIDIA is certainly making a great effort to make their cards as reliable as possible.
AMD Failure Rates
Unlike NVIDIA, our failure logs show that the latest generation of AMD cards is currently seeing an increase in failure rates. The good news is that most of the Radeon R7 and R9 cards that are failing are doing so in-house, so our customers are mostly isolated from the problems. Historically, AMD has about a 50-50 DOA to field failure rate, so we are actually very happy to see that ratio shifting away from failures in the field. However, a 13.46% failure rate during our testing is really, really high and indicates that there is a problem with the latest AMD video cards.
The main question is if this is a temporary problem (like NVIDIA's GTX 5xx series) or if it is part of an overall larger problem. Unfortunately, only time will be able to tell.
Edit 5/5/2014: We've received some questions about what brands and models of card we used. We primarily use Asus DirectCU cards whenever possible, and the basic Asus models when there are not DirectCU versions available. It is possible that the high failure rates are limited to Asus cards, but we have used Asus as our primary supplier for video cards for a long time now. This includes NVIDIA cards as well as the Radeon HD 7xxx and Radeon HD 6xxx cards which have a much lower failure rate than the Radeon R7/R9 cards. This is a clear indication to us that Asus is not the problem, but rather something fundamental to the R7/R9 cards themselves.
Like anything, there are a number of different conclusions you can draw from this data depending on your point of view. NVIDIA, however, is pretty straight forward. Both the DOA and in the field failure rates have been pretty steadily improving since 2009. They had a bit of a setback with the GTX 5xx series, but their current failure rates are currently at an all-time low.
AMD, however, is a bit of a mix. From the Radeon HD 4xxx series to the Radeon HD 6xx series, AMD had an overall steady improvement in failure rates. The Radeon HD 7xxx series was pretty much the same as the generation before it, but the latest generation has shown a huge increase in failure rates. The silver lining is that the majority of the failures are of the type that we can catch in house before the system makes it to the customer. So if you already have a Radeon R7 or R9 card that is working well, the chances are good that you will not have a problem in the future.