Working around TDR in Windows for a better GPU computing experience

There is a funny side effect of using video cards / GPUs for computing on Windows. For moderately demanding things it works fine, but if you execute code that fully utilizes the video card it can make the graphical user interface unresponsive, or at least very slow to respond. This can happen in Linux as well, in fact, but Windows has a feature called Timeout Detection and Recovery (TDR) built into the Windows Display Driver Model (WDDM) that watches out for that sort of situation and resets the graphics driver if it happens. This is designed to keep the system from hanging if it hits a glitch in a program that leads to excessive graphics card usage – and normally it’s a good thing, as it can help prevent Windows from freezing up. However, when you intentionally want to use the video card for demanding work that lasts more than a second or two this feature can be a big problem.

A quick example of this, and the one I used to observe TDR in action, is to run nbody. It is a small GPU-accelerated CUDA program that has a benchmark mode which runs for only a brief moment. It is short enough that it doesn’t trip TDR, and the alternate mode with a visual display of the simulation doesn’t push the GPU hard enough to trip TDR either. If you use the benchmark mode with additional parameters that lengthen the test – specifically increasing the number of bodies in the benchmark simulation – then it will trigger TDR and cause both an error in the CUDA code and a message on the Windows desktop stating that the driver has reset.

This benchmark is really not a terribly important bit of software, but there are applications where this type of code could be important – and moreover, there are many other GPU programs that could be written to harness the full capabilities of a NVIDIA graphics card… so does that make Windows a platform that is unacceptable for such work?

At first it would seem so, but it turns out that you can alter TDR’s behavior to work around this problem! There are two primary options in this regard:

1) Adjusting the length of time before TDR kicks in and kills the driver. The default length is 2 seconds, but if you know that you need more time that can be increased.

2) Turn off TDR entirely.

Both of these are accessible via registry entries, which Microsoft conveniently covers here. This is summary of the important parts, as they pertain to this discussion:

TdrLevel – Specifies the initial level of recovery. The default is to recover on timeout, which is represented by value 3.
KeyPath   : HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
KeyValue  : TdrLevel
ValueType : REG_DWORD
ValueData : 0 – Detection disabled OR 3 – Recover on timeout

TdrDelay – Specifies the number of seconds that the GPU can delay the preempt request from the GPU scheduler. This is effectively the timeout threshold. The default value is 2 seconds.
KeyPath   : HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
KeyValue  : TdrDelay
ValueType : REG_DWORD
ValueData : Number of seconds to delay; 2 seconds is the default

Neither of these keys exists normally in the registry, so they have to be added manually. Here is what they look like, if added and set to match the defaults (for the same behavior as if the keys did not exist):

I found that changing either of these worked for allowing a longer benchmark run of nbody. If you know that the code you want to run shouldn’t take longer than a certain amount of time, the second option seems like it would be preferable – but if you don’t know how long it could take, and don’t want your code interrupted, simply turning TDR off is certainly viable. Just remember that these won’t only affect the CUDA code you run: they would also allow a legitimate graphics software glitch to potentially cause the system to hang.

In addition to the registry keys above, I tried a few other things that are worth noting:

1) I tried running the code on normal GeForce cards (a 980 Ti, specifically), a Titan X, and a Quadro (M4000). I wanted to see if the more professional-grade cards would behave differently, but the Titan X and Quadro both exhibited the same TDR behavior.

2) I wondered if it might only happen if the card that was becoming unresponsive was the primary one, driving the actual GUI / display. So I put both GeForce cards in (980 Ti and Titan X) and ran the benchmark test on just the secondary card… but it still tripped TDR.

3) Lastly, I tried putting a Tesla K40 card in as a secondary – this time alongside the Quadro M4000. Running the benchmark on the Tesla did not trip TDR! That gives one other method to work around TDR, then, though for many developers a GeForce card will be a much more affordable option.