The thin line between stress testing and hardware abuse

Over the last month, we have been streamlining and updating our benchmarking and testing process to make it both more efficient and much more effective. As part of this process, the topic of stress testing has come up several times. Stress testing is something we actively perform on all our systems because we want to catch any bad hardware in-house long before the system reaches the customer. In addition, if a component is mostly fine but has a slight flaw that will cause it to fail in a few months, we want to cause that failure to happen for us during the production process so we can fix it with minimal impact on the end user.

Currently, we do this stress testing with a number of applications including Linpack, Prime95, and Furmark among others. So far, this method has been very effective. Over the last year, 70% of all video card failures and nearly 80% of all CPU failures were caught in-house long before the system reached the customer. The end result is significantly less customers that need to deal with a hardware failure (and the support/repair process that goes along with it) which is a huge win in our books.

However, there are a number of more useful projects we could participate in to stress test our systems such as Folding@Home, SETI@Home, or even cryptocurreny mining (donating the proceeeds to charity). Each of these not only stresses the hardware in the system, but they do so in a productive and useful manner. The idea we have kicked around is that anytime a machine is idle for more than a short period of time – which is often the case when in queue for quality control or shipping – we would have it set to run a combination of these useful applications to put an extended heavy load on the system.

The concern we have had is whether it would actually too effective of a stress test. Is there a line where hardware testing becomes hardware abuse? Yes, we want to cause hardware to fail if it is close to doing so, but if we stress test for an excessive amount of time that could potentially shorten the lifespan of the system. This is a tricky balance to find. We feel very good about the balance we have right now because of the long term in-house and in-field failure rates we have hard data for with our PCs. On the one hand, "if it isn't broken don't fix it" but on the other hand, we might do even better with a change to our processes. What do you think? This is something we are very interested in hearing feedback on in the comments.