What is the most reliable hardware in our Puget Systems workstations?Written on August 21, 2019 by Matt Bach
Here at Puget Systems, one of our top priorities is to provide the highest quality workstations possible for our customers. While the performance testing we do in our Labs department is a big part of this, the product qualification we do behind the scenes is equally important. In the end, high performance is terrific, but it means nothing if the workstation is not reliable as well.
As a part of our constant drive to offer only the highest quality components possible, we track and regularly review the failure rates for each part we carry. This is typically an internal process, but every once in a while, we like to share some of our analysis with the public. Today, we want to give a high-level view of how reliable the different types of components we use in our workstations are in terms of both their shop (failures caught in our production process) and field (failures after the system has shipped to the customer) failure rates. In addition, we want to recognize what has turned out to be the most reliable hardware used in our workstations.
We do want to note that since every single part we carry goes through a very comprehensive qualification process, our failure rates should be much lower than the industry average. There are a lot of great products out there, but also a lot of... not so great... ones. This means that our relative failure rates between certain hardware groups (like ECC vs Standard RAM and GeForce vs Quadro) likely do not match up with what you would see from other workstation manufacturers or if you were to build your own system.
Shop failure analysis
Computer hardware is amazingly complex, and even with our rigorous qualification process, there is always the chance that a part can be defective. However, one of the goals of our production process (where we build, install, and test every single system we sell) is to catch as many of these defective parts as possible. This includes the obvious actions like checking for physical defects, benchmarking the system to check for irregular performance, and ensuring that the cooling is adequate; but it also includes us deliberately trying to break things.
We don't take this to the extreme, of course, but we would much rather cause a part to prematurely fail while it is still in our shop (where we can easily replace it) rather than having it fail a day, month, or year after a customer receives their machine. Because of this, our "Shop Failures" include what most people would define as DOA (dead on arrival) failures, but also failures that occurred during our extensive burn-in process.
To see how reliable different groups of hardware are, we decided to pull data on a quarterly basis going back about 3 years. To help make sense of this data, we will organize it by hardware group:
CPU (Intel) - There isn't much to talk about here - processors have long been one of the most reliable parts in any computer.
Motherboard - Overall, initial motherboard reliability is trending in the right direction. From Q1 2017 to Q2 2018, the failure rate was on a steady decline. Since then, it has been a bit volatile but the average is flat overall.
RAM - We decided to chart standard and ECC (which includes both registered and non-registered models) RAM separately because there are a few very interesting data points. First of all, the initial quality of ECC RAM has been rock solid for years and is one of the least likely parts to give us problems during our production process. Standard RAM has also been very good in the last few years, but it is interesting to see how much it improved in Q4 of 2017. Two things happened at this time: We moved to DDR4-2666 RAM (from DDR4-2400), and we moved to the (new at the time) Intel 8th Gen processors like the Intel Core i7 8700K. Either one of these could be the cause of the improvement, although it is likely a combination of both.
GPU - Similar to RAM, we split up the failure rate between GeForce and Quadro video cards. Starting with GeForce, we have seen a slow, but steady, increase in shop failures since about Q3 2017. We are not quite sure why this is since we have not changed brands much during this time period, but it is an alarming trend. For Quadro, the shop failure rate has a pair of huge spikes in Q2 2017 and Q4 2018 which is almost entirely due to a couple of bad batches of Quadro P600 cards with defective HDMI or Mini DP ports.
Storage - For storage, there are two stories depending on whether the drive is a platter or SSD. First, for WD platter drives the initial quality saw a small improvement at Q1 2018, and has been stable (and terrific) since then. The bigger story (and actually what prompted us to put this post together) was the initial quality for Samsung SSDs. Simply put, even though we sell thousands of Samsung drives every year, we only ever have a handful that gives us any problems during our production process.
Power Supply - Rounding out our data, power supplies have been pretty consistent over the last 3 years. We have primarily been using EVGA power supplies during this time, and it is good to see this stable reliability since inconsistent quality is a major problem we've had with other power supply brands.
Field failure analysis
While we strive to make the most reliable workstations possible, hardware can eventually break no matter how much time and effort we put into it. In many ways, these "field" failures that happen after we have shipped the system are much more important than the "shop" failures. If a part fails on us during the production process, it is an inconvenience, but we can typically replace it and resume the process fairly quickly. When a part dies on a customer, however, it can be a very annoying process to get it replaced - even with our industry-leading support and repair department.
Once again, we are charting the failure rates since Q3 2016 (the last three years). However, note that the dates shown are when we purchased and installed the part, not the date that the part failed. In other words, the older the date on the chart, the longer that part has been running in the field. Since older parts are often more susceptible to failure, this means that it is completely normal for the "field" failure rate to increase as you look further back in time.
With that noted, let's examine each category individually:
CPU (Intel) - While Intel CPUs have been very reliable over the last two years, it is interesting to see a spike in failures on systems that are about three years old. Looking at the data, this increase in reliability appears to be primarily driven by the fact that we were using the Broadwell-based Intel X-series CPUs (such as the Core i7 6900K) three years ago which have a higher failure rate than the newer models.
Motherboard - Here, we see the kind of trend that you might expect from computer hardware in general. As we look further back in time, the failure rate increases at a steady rate. What this means is that the older your system is, the more likely your motherboard is to develop issues.
RAM - Interestingly, there isn't a huge difference in reliability for standard and ECC RAM over the three-year period we charted. ECC RAM is very slightly more reliable, but it isn't by very much. In either case, it appears that the age of the RAM only has a minor impact on its reliability.
GPU - For video cards, the reliability is quite a bit different for GeForce and Quadro cards. On the Quadro side, the cards are extremely reliable for the first year, but there is a sharp rise in failure rate starting around Q2 2018. This is interesting because we have primarily used the Quadro P-series all the way back to the end of 2016, so it was not a matter of a new product line changing the quality of the cards. For GeForce, however, the reliability has actually been getting a bit worse recently with an uptick in failure rate in the last year. This is a worrying trend since it means that the newer GeForce cards are having problems at a rate higher than 2-3 year old cards.
Storage - For the WD platter drives, we are seeing a very small increase in failures over time, but it isn't by very much. Samsung SSDs have a few failures if you go all the way back to 2016, but otherwise have been nearly perfect in terms of reliability.
Power Supply - The field failure rate for power supplies over the last three years is very similar to motherboards - as the power supply gets older, it becomes more and more likely to fail.
Another way we can look at this data is to group the reliability by year and stick all of the results onto a single chart. This doesn't give us quite the fine detail of our above charts, but it helps to give a clearer look at how the reliability of each hardware type changes depending on the age of the system.
Looking at the data this way reveals some very interesting information. First, if you have a relatively new system (less than a year old), by far the most likely component to break is an NVIDIA GeForce card. As far as what is least likely to develop problems on a fairly new system, Samsung SSDs are very reliable with only a single drive having failed in the field, but Quadro GPUs take the cake with zero failures within this time period.
However, as the system gets older, power supplies, motherboards, and to a slightly lesser extent Quadro GPUs all decrease in reliability. By the three-year mark, you are most likely to encounter a problem with your motherboard or PSU than anything else. On the positive side, Samsung SSDs fare the best over this three-year period, followed by ECC RAM and standard RAM.
While the failure rate for many types of components changes depending on the age of the system, if we simply sum up the failure rates for each hardware group, we get a great idea of the overall reliability for each type of component.
Looking at the data this way, there are a number of things that stand out. First, the fact that WD platter hard drives are as reliable as Intel CPUs over a three-year period was unexpected. Second, while Quadro GPUs are more reliable in the field than GeForce cards, a couple of bad batches of Quadro P600 cards with defective video ports means that, as a whole, Quadro has been less reliable than GeForce for us.
Since the title of this post is "What is the most reliable hardware in our Puget Systems workstations?", however, let's go ahead and answer that question:
Whether you are looking at initial reliability or reliability over time, it is clear that the Samsung SSDs are easily the most reliable hardware we have used in our workstations over the past three years.
Keep in mind that we do not restrict our customers to enterprise-class drives or anything like that - most of what we use are the SATA and NVMe-based consumer EVO and PRO product lines. And yet, they are ~50% more reliable than ECC RAM (of which reliability is entirely the point), or 3x more reliable than Intel CPUs.
One thing we do want to make clear is that this DOES NOT mean you don't need to back up your data if you are using a Samsung SSD. Yes, the reliability is excellent, but there is always the chance that the drive can fail. In addition, a reliable drive doesn't protect you against malware, viruses, lightning strikes, or simply accidentally deleting something you didn't mean to. Your data is far more valuable than any piece of hardware in your computer, and you should always take active measures to protect it.