Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1550
Article Thumbnail

What is the most reliable hardware in our Puget Systems workstations?

Written on August 21, 2019 by Matt Bach
Share:

Introduction

Here at Puget Systems, one of our top priorities is to provide the highest quality workstations possible for our customers. While the performance testing we do in our Labs department is a big part of this, the product qualification we do behind the scenes is equally important. In the end, high performance is terrific, but it means nothing if the workstation is not reliable as well.

As a part of our constant drive to offer only the highest quality components possible, we track and regularly review the failure rates for each part we carry. This is typically an internal process, but every once in a while, we like to share some of our analysis with the public. Today, we want to give a high-level view of how reliable the different types of components we use in our workstations are in terms of both their shop (failures caught in our production process) and field (failures after the system has shipped to the customer) failure rates. In addition, we want to recognize what has turned out to be the most reliable hardware used in our workstations.

What is the most reliable hardware in Puget Systems workstations?

We do want to note that since every single part we carry goes through a very comprehensive qualification process, our failure rates should be much lower than the industry average. There are a lot of great products out there, but also a lot of... not so great... ones. This means that our relative failure rates between certain hardware groups (like ECC vs Standard RAM and GeForce vs Quadro) likely do not match up with what you would see from other workstation manufacturers or if you were to build your own system.

Shop failure analysis

Computer hardware is amazingly complex, and even with our rigorous qualification process, there is always the chance that a part can be defective. However, one of the goals of our production process (where we build, install, and test every single system we sell) is to catch as many of these defective parts as possible. This includes the obvious actions like checking for physical defects, benchmarking the system to check for irregular performance, and ensuring that the cooling is adequate; but it also includes us deliberately trying to break things.

We don't take this to the extreme, of course, but we would much rather cause a part to prematurely fail while it is still in our shop (where we can easily replace it) rather than having it fail a day, month, or year after a customer receives their machine. Because of this, our "Shop Failures" include what most people would define as DOA (dead on arrival) failures, but also failures that occurred during our extensive burn-in process.

Puget Systems shop hardware failure rates

To see how reliable different groups of hardware are, we decided to pull data on a quarterly basis going back about 3 years. To help make sense of this data, we will organize it by hardware group:

CPU (Intel) - There isn't much to talk about here - processors have long been one of the most reliable parts in any computer.

Motherboard - Overall, initial motherboard reliability is trending in the right direction. From Q1 2017 to Q2 2018, the failure rate was on a steady decline. Since then, it has been a bit volatile but the average is flat overall.

RAM - We decided to chart standard and ECC (which includes both registered and non-registered models) RAM separately because there are a few very interesting data points. First of all, the initial quality of ECC RAM has been rock solid for years and is one of the least likely parts to give us problems during our production process. Standard RAM has also been very good in the last few years, but it is interesting to see how much it improved in Q4 of 2017. Two things happened at this time: We moved to DDR4-2666 RAM (from DDR4-2400), and we moved to the (new at the time) Intel 8th Gen processors like the Intel Core i7 8700K. Either one of these could be the cause of the improvement, although it is likely a combination of both.

GPU - Similar to RAM, we split up the failure rate between GeForce and Quadro video cards. Starting with GeForce, we have seen a slow, but steady, increase in shop failures since about Q3 2017. We are not quite sure why this is since we have not changed brands much during this time period, but it is an alarming trend. For Quadro, the shop failure rate has a pair of huge spikes in Q2 2017 and Q4 2018 which is almost entirely due to a couple of bad batches of Quadro P600 cards with defective HDMI or Mini DP ports.

Storage - For storage, there are two stories depending on whether the drive is a platter or SSD. First, for WD platter drives the initial quality saw a small improvement at Q1 2018, and has been stable (and terrific) since then. The bigger story (and actually what prompted us to put this post together) was the initial quality for Samsung SSDs. Simply put, even though we sell thousands of Samsung drives every year, we only ever have a handful that gives us any problems during our production process.

Power Supply - Rounding out our data, power supplies have been pretty consistent over the last 3 years. We have primarily been using EVGA power supplies during this time, and it is good to see this stable reliability since inconsistent quality is a major problem we've had with other power supply brands.

Field failure analysis

While we strive to make the most reliable workstations possible, hardware can eventually break no matter how much time and effort we put into it. In many ways, these "field" failures that happen after we have shipped the system are much more important than the "shop" failures. If a part fails on us during the production process, it is an inconvenience, but we can typically replace it and resume the process fairly quickly. When a part dies on a customer, however, it can be a very annoying process to get it replaced - even with our industry-leading support and repair department.

Puget Systems field hardware failure rates

Once again, we are charting the failure rates since Q3 2016 (the last three years). However, note that the dates shown are when we purchased and installed the part, not the date that the part failed. In other words, the older the date on the chart, the longer that part has been running in the field. Since older parts are often more susceptible to failure, this means that it is completely normal for the "field" failure rate to increase as you look further back in time.

With that noted, let's examine each category individually:

CPU (Intel) - While Intel CPUs have been very reliable over the last two years, it is interesting to see a spike in failures on systems that are about three years old. Looking at the data, this increase in reliability appears to be primarily driven by the fact that we were using the Broadwell-based Intel X-series CPUs (such as the Core i7 6900K) three years ago which have a higher failure rate than the newer models.

Motherboard - Here, we see the kind of trend that you might expect from computer hardware in general. As we look further back in time, the failure rate increases at a steady rate. What this means is that the older your system is, the more likely your motherboard is to develop issues.

RAM - Interestingly, there isn't a huge difference in reliability for standard and ECC RAM over the three-year period we charted. ECC RAM is very slightly more reliable, but it isn't by very much. In either case, it appears that the age of the RAM only has a minor impact on its reliability.

GPU - For video cards, the reliability is quite a bit different for GeForce and Quadro cards. On the Quadro side, the cards are extremely reliable for the first year, but there is a sharp rise in failure rate starting around Q2 2018. This is interesting because we have primarily used the Quadro P-series all the way back to the end of 2016, so it was not a matter of a new product line changing the quality of the cards. For GeForce, however, the reliability has actually been getting a bit worse recently with an uptick in failure rate in the last year. This is a worrying trend since it means that the newer GeForce cards are having problems at a rate higher than 2-3 year old cards.

Storage - For the WD platter drives, we are seeing a very small increase in failures over time, but it isn't by very much. Samsung SSDs have a few failures if you go all the way back to 2016, but otherwise have been nearly perfect in terms of reliability.

Power Supply - The field failure rate for power supplies over the last three years is very similar to motherboards - as the power supply gets older, it becomes more and more likely to fail.

Another way we can look at this data is to group the reliability by year and stick all of the results onto a single chart. This doesn't give us quite the fine detail of our above charts, but it helps to give a clearer look at how the reliability of each hardware type changes depending on the age of the system.

Hardware reliability over time

Looking at the data this way reveals some very interesting information. First, if you have a relatively new system (less than a year old), by far the most likely component to break is an NVIDIA GeForce card. As far as what is least likely to develop problems on a fairly new system, Samsung SSDs are very reliable with only a single drive having failed in the field, but Quadro GPUs take the cake with zero failures within this time period.

However, as the system gets older, power supplies, motherboards, and to a slightly lesser extent Quadro GPUs all decrease in reliability. By the three-year mark, you are most likely to encounter a problem with your motherboard or PSU than anything else. On the positive side, Samsung SSDs fare the best over this three-year period, followed by ECC RAM and standard RAM.

Conclusion

While the failure rate for many types of components changes depending on the age of the system, if we simply sum up the failure rates for each hardware group, we get a great idea of the overall reliability for each type of component.

Puget Systems overall hardware failure rates

Looking at the data this way, there are a number of things that stand out. First, the fact that WD platter hard drives are as reliable as Intel CPUs over a three-year period was unexpected. Second, while Quadro GPUs are more reliable in the field than GeForce cards, a couple of bad batches of Quadro P600 cards with defective video ports means that, as a whole, Quadro has been less reliable than GeForce for us.

Since the title of this post is "What is the most reliable hardware in our Puget Systems workstations?", however, let's go ahead and answer that question:

Whether you are looking at initial reliability or reliability over time, it is clear that the Samsung SSDs are easily the most reliable hardware we have used in our workstations over the past three years.

Most Reliable Hardware in Puget Systems Workstations: Samsung SSDs

Keep in mind that we do not restrict our customers to enterprise-class drives or anything like that - most of what we use are the SATA and NVMe-based consumer EVO and PRO product lines. And yet, they are ~50% more reliable than ECC RAM (of which reliability is entirely the point), or 3x more reliable than Intel CPUs.

One thing we do want to make clear is that this DOES NOT mean you don't need to back up your data if you are using a Samsung SSD. Yes, the reliability is excellent, but there is always the chance that the drive can fail. In addition, a reliable drive doesn't protect you against malware, viruses, lightning strikes, or simply accidentally deleting something you didn't mean to. Your data is far more valuable than any piece of hardware in your computer, and you should always take active measures to protect it.

Tags: Hardware, Reliability
Jakub Badełek

This is quite a fascinating article! I wonder how it's gonna look in next two years ;)

BTW, i hope you won't get angry, I guess you hear this question a lot - when we'll see new LR test? :P we got new processors from AMD which might mix things up and in addition LR threw an update claiming performance boost! ;) previous article based on consumer processors is a little old - just saying ;)

Posted on 2019-08-22 09:17:14

LR is way up there, we've just been pretty swamped with all the recent hardware launches. I also still don't have any great ideas for how to test things like brush performance, and that kind of thing is something I really, really want to test. If I can't figure it out soon though, we'll have to go back to just testing thing like import /export /etc. Better than nothing at least.

Posted on 2019-08-22 13:21:59
Jakub Badełek

What about framerate capture software used for measuring FPS in games? (For example Fraps). I know it also measures fps in other software than games like video players etc. Maybe one of them, not necessarily fraps, would measure fps of animation of brushes in LR. Just a wild idea. Thanks for response btw! :-)

Posted on 2019-08-22 17:11:41
Jakub Badełek

Matt, I just checked that... fraps DOES show number of FPS when using brush, or anything that modifies the image in Lightroom... i know that some software allows for measuring frame-times and delays etc. Maybe worth considering ;) check out fraps, it's a freeware.

Posted on 2019-08-22 19:31:45

Interesting, it has been a while since I tried that since I couldn't get it to work on anything like LR a few years back. If they fixed that though, that definitely could be a method we could use. I have a few ideas I'm hoping we can work directly with the LR devs to pursue that should end up working even better if we can convince them to dedicate resources to it, but fraps could be a good backup.

Thanks for the heads up!

Posted on 2019-08-26 19:18:05
Jakub Badełek

Just a quick update - Fraps doesn't work with GPU acceleration turned off, I tried using it to compare fps values on my own... I just managed to see how vast the difference is between 4k and FHD on my machine (4-6fps vs 14-16fps - "old" i5 4570, 1070ti, 16GB RAM). Maybe other software will work better? maybe it's a matter of that the monitors I used were connected via GPU, instead of motherboard?... i leave it up to you ;) I always wanted to know how efficient is the GPU acceleration in Lightroom as it's always been kind of a myth to me.

Posted on 2019-09-02 20:17:07
Joe Geske

The other issue with LR is that it honestly doesn't feel like Adobe has put together a rock solid version in like 2 years. The most recent update with supposed GPU improvements is perhaps my worst LR experience yet. Went through 3 different drivers to get my 1080 ti to register as even being capable of GPU accelerated editing, only to eventually find that performance feels more sluggish now than it did prior to the update. I honestly can't fathom how they manage to make things progressively worse update after update.

Also personally I think more weight needs to be placed on bulk tasks like import export ect. As a highish volume wedding photographer I spend way more time batch editing and building smart previews and such than I do working with brushes. My 7820x has been slowly showing performance issues with recent windows updates and LR updates. Excited to see how the 3950x with its 16 cores will handle export tasks.

Posted on 2019-08-25 20:23:44
bill bane

Matt,

Perhaps you are familiar with Luminar and Topaz. I use them in addition to LR and Photoshop, with many round trips. I am very confidant there will be more and more folks using them, along with LR and Photoshop, but especially for the many users who are not happy with Adobe and their pricing!

I have a 4770K, 32Gb, Samsung 2gb SSD system and such round-tripping brings my system (and windows) to its knees.Sometimes crashes.

The next system tradeoff is more memory (64, 128) vs maxed single core speed vs more cores. It is impossible for me to guesstimate about which lever is critical, but I am certain that each involves lots of perhaps unnecessary $$ being wasted!! Maybe even PCI gen4 vs gen3 will have an effect since my files end up huge, with 2gb not uncommon, so I do not know if I should consider Intel are only AMD.

Insight to these tradeoff would be beyond wonderful.

A script for this round tripping should be possible via bridge. Luminar and topaz are filters, as is Lightroom's exact (development) equivalent, "Camera raw", thus all of them probably could be sequentially opened (and perhaps nothing else!!), maybe with some other standard Photoshop actions.

If you just try this manually, I think you will be shocked by the time required. These would be interesting, especially, in my view, too many of the standard post processing benchmarks are tiny, quick steps that are not related to where the time sinks are to be found. The key is to benchmark round-tripping.

Thanks for what has been done, and for what I hope might be.

Posted on 2019-09-06 00:14:01
Bill Naiman

My 2 cents worth about LR:

Yes, LR brushing with range masks does cause LR to slow down more and more as the feature is used. In concept, the feature is very, very good. I like it and use it. However, IMO, there are internal LR implementation issues that need to be addressed by Adobe. For example, Windows Task Manager shows LR using more and more memory during an editing session that is commensurate with the declining responsiveness. An exit of LR and reload of LR will show the memory allocation to be less, performance is increased, only to repeat the cycle of feature use, memory use increase, and performance degradation. Again, IMO, it leads me to think that LR has bugs associated with certain features that are exacerbated by larger raw files from new camera and 4K displays.

Yes, the underlying HW is important. Yet, buggy SW can still overwhelm powerful HW.

In addition to LR, I use PS, Topaz, Luminar, On1, NIK plugins, and more. Only LR will bring my I7 32GB ram, 1TB SSD to its knees.

I have implemented all the the performance recommendations Adobe lists for LR.... smart previews, large cache on SSD, catalog and key files on SSD, compatible GPU, and so on.

Lastly, IMO, it feels to me that some of the features that Adobe has added to LR were bolt on tweaks (smart brushing, smart gradients, textures) to keep pace with features offered by competitive products. Based on my real world SW experience over the years, i have seen well meaning tweaks have significant unintended consequences. I really wish Adobe would address some of the frequently discussed performance issues in multiple forums as the primary focus of a major update. The August 2019 release was a small step in the right direction, but the needed leap still remains.

Posted on 2019-09-12 14:22:58

So, the most obvious way to reduce the failure chance of my PC would be to buy the most reliable mainboard. Do you have a statistic about different mainboard brands/models/chipsets and their failure rates?

Posted on 2019-08-26 10:30:28

Unfortunately, not really. We have a number of brands that we know are great (Gigabyte being at the top right now, but MSI and ASUS are also great), but a lot of the reason we carry one brand over another is because of the level of support we can get from them. It seems like no matter the brand we've used in the past, there are always problems that come up, especially right after a chipset is launched. The brand that is currently giving us the best support to fix those issues tends to be the one we end up offering to our customers.

That means that we don't really have great stats for reliability for one brand vs another. A few suggestions based on what we've seen, however:

1) Best brands at the moment I would say are Gigabyte, followed by MSI, ASUS, and probably Supermicro (although we don't use much of their stuff)
2) Try to minimize the number of extra features on the board. We often have to carry the boards with everything since we want to carry as little motherboards as possible (since more boards means a lot more potential individual issues), but if you don't need something like Thunderbolt, don't get a board with it.
3) Contrary to #2, avoid the low-end chipsets. For example, a H370 chipset may be all you need, but the higher-end "Z" chispets tend to get higher quality components. So even if you don't need things like overclocking support, the higher-end chipsets tend to be overall more reliable and stable.

Posted on 2019-08-26 19:10:40
Max Rockbin

2 Questions: In your stats, is a failure necessarily a catastrophic failure - - a dead part? Or would any malfunction count as a failure? I have a samsung pro SSD that had a read failure and had to map out a bad address. Is that a failure? An Asus motherboard had the USB C port die. Is that a failure for purposes of your stats? ALSO it would be interesting to know the failure rate of your overall system (eg. failure due to any part dying) over time. I could add up the numbers in your final graph, but I'm guessing there are cases with more than one part failing in the same machine. It LOOKS like you have something like a 15% system failure rate over 3 years. WAY higher than I'd have guessed. Thank You!

Posted on 2019-09-12 16:06:41

In terms of shop/DOA failure, it is pretty much anything that prevents us from selling the part to a customer. Most often that is things like stability or performance issues, but 100% includes things like dead ports and physical defects (especially when it comes to the chassis). Unless everything is working as it is supposed to, we are not shipping it to a customer which means it gets RMA'd. Everything you listed is definitely something we would fail a part for, no question at all.

Field failure is a bit looser since it really depends on the customer - whatever they deem as important enough to either do a part swap or send the system in for repair gets counted. Things like stability/performance issues are almost always counted, but there are some customers who opt to simply live with a bit of electrical noise, a single dead port, etc. rather than having their system down while it gets fixed. Right now, we only track failures of things we actually have to replace, so there definitely will be a few that don't get counted even though it wouldn't meet our initial quality standard.

As for an overall failure rate, from what I understand it isn't as easy as just adding together the numbers. Especially since you could have 4-8 sticks of RAM, multiple GPUs, etc. I went through some of our other reports, and within the time period we examined in this post we have had about a 9% repair ticket rate per sale. Of those, I would guess that about 75% are due to actual part failures, the rest being things like upgrades, OS re-installs, or other non-hardware problems. So over the last three years, call it about a 7% chance of a system in the field having a part fail that was significant enough for it to be replaced.

But... multiple parts can fail and if we include the initial DOA failures we catch in the shop, it actually works out to be about a 17.4% chance that a system has a part (or multiple parts) fail over the last 3 years. A good chunk of that are DOA failures that never make it to the customer, but if you want an overall number, that is what our stats say. Considering there are people who live with partially failed parts (like a dead port), I don't think it is stretching it too much to say that over the entire life of a system, there is a 1 in 5 chance that a part will fail.

Posted on 2019-09-12 17:27:06