Table of Contents
Introduction
Here at Puget Systems, we believe that computers should be a pleasure to purchase and own. They should get your job done, and not be a hindrance. That means it is important for the hardware we include in our workstations to both perform well (hence the testing we do in our Labs department) and also be as reliable as possible. Our product qualification process is the first step in building a reliable product line, but we also maintain records of hardware failures – both those we come across during the assembly process, as well as those reported to us by customers in the field – so that we can analyze failure rates over time. That information, in turn, can help guide our selection of products to qualify in the future.
While this is normally an internal process, sometimes we like to give our customers and readers a peek into aspects of our company that aren't usually in the public spotlight. In this article, we will look at both shop (caught in our production process) and field (after the system has shipped to the customer) failure rates of various hardware brands and models over the last three years. At the end, we will crown the most reliable component used in our workstations.
Shop Versus Field Failures
Computer hardware is very complex, and our qualification process takes place with small sample sizes – so there is always the chance that a small number of parts can be defective once we start carrying them en masse. However, one of the goals of our production process (where we build, install, and test every single system we sell) is to catch as many of these manufacturing defects as possible. This includes steps like checking for physical flaws, benchmarking the system to look for irregular performance and monitoring component temperatures under heavy load. We won't catch 100% of problem parts this way, of course, but we would much rather cause a part to prematurely fail while it is still in our shop (where we can easily replace it) rather than having it fail days, months, or even years after a customer receives their machine. Because of this, our "shop failures" include what most people would define as DOA (dead on arrival) failures, but also failures that occurred during our extensive burn-in process.
While we do our best to make the most reliable workstations possible, hardware can eventually break no matter how much time and effort we put into stress testing it. In many ways, these "field failures" that happen after we have shipped the system are more important than those we catch in the shop. If a part fails on us during the assembly process it is certainly an inconvenience, but we can usually replace it and resume production fairly quickly. When a part dies on a customer, however, it can be a very annoying process to get it replaced – even with our industry-leading support and repair department.
Some Caveats
Since every part we carry goes through a very comprehensive qualification process, our failure rates should be much lower than the industry average. There are a lot of great products out there, but also many questionable ones. This means that our relative failure rates between certain hardware groups may not match up with what you would see if you were to build your own system, or even buy a workstation from other computer manufacturers.
Additionally, we are filtering out of this data any failures that we believe were caused accidentally by our employees or customers, as well as those related to damage in shipping. The goal here is to isolate problems from the hardware itself, rather than human error.
Lastly, to keep these graphs from being overly complex, we will be grouping components by brand and/or product family – and only including groups that we have statistically significant amounts of data for. If we only sold a handful of a given item type, and it had no failures, we cannot be certain whether we simply got lucky or if we didn't have a big enough sample size. While we will only be covering hardware that we actively sold during 2021, we will include failure data from the last three years (2019 through 2021) when available and applicable. For example, we carried Intel's 10th Gen Core processors during the first few months of 2021 – so we will include all data on these processors from 2019 through 2021 when looking at their reliability. We didn't carry the 9th Gen Core processors at all this year, though, so they are excluded.
Motherboards
This is one major category that we are going to gloss over rather quickly because there are far too many boards that we have carried over the course of the year with no easy and fair way to group them. We could group by brand, chipset, or CPU platform – but all of those introduce unfair complications. Motherboards are also the most complex parts of a modern computer because they are made up of so many separate components and a failure of just one bit ends up meaning a failure for the whole board. However, there is an honorable mention I want to call out here: the Asus Prime X299 Deluxe II. This is the most recent board we've carried for Intel's Core X series, and with over 100 of them sold we have not had a single failure either in the shop or reported in the field! That is extremely rare for motherboards and probably speaks to both the maturity of the X299 chipset as well as Asus' quality control on this particular line.
Processors (CPUs)
Processors are generally one of the most reliable parts in a computer. Although internally complex, they have no moving parts and are heavily tested by manufacturers – so as long as they are kept well cooled and not pushed beyond their design specifications (overclocked) we have found them to exhibit very low failure rates. We have sufficient data to looking at the following brands and lines which we carried in 2021:
- AMD Ryzen 5000 Series
- AMD Threadripper 3000 Series
- AMD Threadripper Pro 3000 Series
- Intel Core 10th Gen
- Intel Core 11th Gen
- Intel Core X 10000 Series
- Intel Xeon W 2200
- Intel Xeon Scalable 2nd Gen
Intel's Xeon W-3300 and Core 12th Gen product lines are new enough that we don't yet have enough data on them (sales numbers and/or time in the field) for it to be fair to include them in this comparison.
Intel's Xeon processors have fantastically low failure rates, with no failures at all among the W-2200 series. However, that is the category here which we have the fewest sales of and therefore least data on, but combined with the Xeon Scalable family the Xeons overall had few failures here in our facility and none for customers in the field!
AMD CPUs in general had higher failure rates than Intel, but we did see an oddly high rate of failures with Intel's consumer-oriented 11th Gen processors… which seems odd, especially next to the very low rates shown by the preceeding 10th Gen.
Memory (RAM)
We carry memory from multiple manufacturers, depending on what is available and behaves well our current motherboards, so instead of breaking this category out by brand we will split it by type:
- Normal memory
- Memory with Error Checking and Correcting (ECC)
- Registered Memory (also with ECC)
All of the memory we currently use is DDR4 and clocked at 3200MHz, though that will likely be changing in 2022 with the arrival of DDR5.
Its not too surprising that memory with ECC saw lower failure rates than normal RAM, and we've seen similar results in past articles as well. It is odd that there is a little bit higher field-failure rate for non-Registered ECC memory than the other categories, but it is worth noting that we sell far less of that memory type than the other two – so if we had a larger sample size it is possible that would even out more in line with the other two types of RAM.
Video Cards (GPUs)
As is the case with memory, we carry GeForce RTX 30 Series video cards from several manufacturers – Asus, EVGA, Gigabyte, MSI, PNY, and NVIDIA – depending on what is available. For the past couple of years, availability has been particularly limited, and we've actually had the best luck getting hold of NVIDIA's Founders Edition cards. By comparison, we've used much smaller quantities of cards from other manufacturers – so for the purposes of this analysis, we have lumped them together. This leaves us with the following four groupings for video cards:
- NVIDIA GeForce RTX 30 Series Founders Edition
- Asus, EVGA, Gigabyte, MSI, and PNY GeForce RTX 30 Series
- NVIDIA Quadro RTX Series
- NVIDIA Professional RTX A Series
That last entry may look a little odd to video card enthusiasts, and the naming doesn't really match what NVIDIA advertises those cards as, but it is due to their dropping of the "Quadro" name brand from the latest release of their professional graphics cards. These RTX A Series models, which replaced the higher-end Quadro RTX cards early in 2021, are ostensibly meant to fulfill the same role (and are certainly priced in the same way) as Quadro cards of years past, but they no longer carry any official brand name from NVIDIA.
The most noticeable issue here is a massive spike in shop failures for the Quadro RTX Series cards. This is not as bad as it looks, in a sense: it is almost entirely because of a manufacturing problem with the USB-C "VirtualLink" port on RTX 4000 video cards. All of them we received from May 2020 onward were defective, so huge swaths of our inventory of those cards failed our testing here – and for a long time after that discovery, we stopped offering them at all. Eventually we switched to simply blocking off that port and warning customers not to use it, but we still ended up considering almost 15% of the RTX 4000 cards we got to have failed, which heavily skewed the overall failure rate for that category.
Beyond that, it is also interesting to note that NVIDIA's GeForce RTX 30 Series Founders Edition cards are markedly more reliable than those from third-party manufacturers. Those FE cards also have unique cooling layouts in this generation, making them more ideally suited to use in dual video card combinations. Only the "professional" RTX A Series has a lower field failure rate, but since those cards are also newer they haven't yet had as much opportunity to fail.
Storage Drives (HDDs & SSDs)
For storage drives, we are back to breaking things down by both manufacturer and product family. Over the last few years we have primarily used Western Digital hard drives, and while we've started to use Seagate as well in 2021 we don't yet have enough sold to have reliable failure rate data. On the solid-state side of things, we carry several of Samsung's offerings – but we've also been utilizing Seagate's Firecuda line of M.2 drives, and we do have enough of those under our belt to include in this analysis. That gives the following groupings:
- Western Digital Red HDDs (quiet models)
- Western Digital Ultrastar HDDs (enterprise-class)
- Samsung 870 EVO & QVO SATA SSDs
- Samsung 860 Pro SATA SSDs
- Samsung 980 Pro M.2 NVMe SSDs
- Seagate Firecuda 520 M.2 NVMe SSDs
There are three things that stand out in the chart above:
- The Western Digital hard drives we carried have similar failure rates, despite being targeted at very different users.
- Seagate's Firecuda 520 drives had a rather high rate of failures caught here at the shop, compared to Samsung SSDs at least, but so far none have been reported to us as failing in the field.
- Samsung SSDs continue to show amazingly low failure rates, both in the shop and in the field! This is another trend we have seen in years past, and it is good that it continues to be the case. With over 1,000 sold in the period we are looking at, and zero failures, the 870 EVO and QVO drives, in particular, are amazingly reliable in our experience.
Power Supplies (PSUs)
Our last category, power supplies, may also be the simplest in terms of what data we are able to present. In 2021 we primarily offered Super Flower PSUs, with EVGA as a backup when those were not available. We have carried more EVGA in years past, though, so the data on that brand goes a little further back in our records. For each brand, we are splitting them into three wattage capacities that cover the vast majority of what we carry:
- EVGA SuperNOVA 850W
- EVGA SuperNOVA 1200W
- EVGA SuperNOVA 1600W
- Super Flower LEADEX 850W
- Super Flower LEADEX 1200W
- Super Flower LEADEX 1600W
In some cases, one of these entries might include two or three different models/revisions – but we aren't going to split the chart further.
There are two stories that this data appears to be telling. First, higher wattage power supplies have an increased chance of failure in the field – which makes sense, given that they are likely to be handling a lot larger and possibly more sustained power loads. Second, it looks like Super Flower has a lower rate of shop failures – and while we haven't been carrying that brand as long, the length of time we've carried them should not impact that (it would only impact field failures over time).
For both of these brands, failure rates in the field look to be very low. We also use fully modular power supplies in all of our tower workstations now, so when a failure does occur it is actually pretty easy to swap out without necessarily needing to send the whole system in for repair.
Conclusion
There are other types of hardware we use in our workstations – network interface cards, video capture devices, RAID controllers, and more – but the categories above are the main components and thus the ones for which we have the most data on failure rates. And after looking at all of that, which is the most reliable? Samsung solid-state drives, once again!
For those who have read our past articles in this series, this will likely come as no surprise. Solid-state drives have been extremely reliable since their introduction to workstation PCs over a decade ago, and Samsung in particular has done a stellar job of minimizing failure rates with the models we have carried. In fact, looking back over the entire history of Samsung SSDs in our records, we have sold over 35,000 drives and had less than 100 of those fail – very impressive!
Puget Systems offers a range of powerful and reliable systems that are tailor-made for your unique workflow.