Dr. Donald Kinghorn (HPC and Scientific Computing)

GTX 980 Ti Linux CUDA performance vs Titan X and GTX 980

Written on June 12, 2015 by Dr. Donald Kinghorn
Share:

The latest addition to the NVIDIA GeForce line, the GTX 980 Ti, is a significant improvement over the GTX 980 for single precision compute tasks using CUDA. It rivals, and is very similar to, the astounding performance of the of the Titan X. The most significant difference is that the GTX 980 Ti’s 6GB of memory is half that of the Titan X’s 12GB (but 2GB more than the 980’s 4GB). For computational work the extra memory of the Titan X may be important to you. It can make life easier when you are working on new code and haven’t optimized the memory buffering to keep the GPU loaded. In general, more memory is always a good thing. The Titan X is so fast for single precision calculations that the extra memory can help keep large workloads flowing. However, many CUDA accelerated programs have excellent memory buffering from host to card and 6GB will be enough to keep those CUDA cores loaded.

I’ll be adding the GTX 980 Ti to my GPU computing testing results from here out. I should have a post up on some Molecular Modeling applications soon. I’ve been working with GROMACS lately and I suspect that the 980 Ti may be the card of choice for this. For now I just want to show a simple CUDA benchmark on Linux so you can see what to expect for serious compute performance from the 980 Ti.

Getting the GTX 980 Ti to work on Linux

For some reason the 980 Ti was more of a problem to get working under Linux than the Titan X when it was first released. The Titan X “just worked” under Linux even before there was a functioning Windows driver! The 980 Ti required the latest NVIDIA beta driver!

System setup;

Puget Systems Peak Tower
  • Dual 10 core Xeon 2687W v3 @ 3.1 GHz
  • 64GB DDR4-2133 Reg ECC
  • NVIDIA Titan X 12GB X16 1000MHz core clock 3072 cuda cores
  • NVIDIA GTX 980 Ti 6GB X16 1000MHz core clock 2816 cuda cores
  • NVIDIA GTX 980 4GB X16 1126MHz core clock 2048 cuda cores
  • Linux CentOS 7
  • NVIDIA kernel driver modules version 352.09 (beta)

Make careful note of the NVIDIA driver version! I had to use the most recent beta driver to get the GTX 980 Ti to work. That means I had to use the downloaded shell archive .run file which means the driver install is outside of the normal package management and all of the usual difficulties with that situation apply. I generally prefer to have the NVIDIA driver installed from the “package managed” cuda repo.

The driver I used was,

http://us.download.nvidia.com/XFree86/Linux-x86_64/352.09/NVIDIA-Linux-x86_64-352.09.run

I suggest that you look at my post Install NVIDIA CUDA on Fedora 22 with gcc 5.1 for an approach you might want to consider taking to get CUDA and the beta driver installed. [ I do a CUDA install from the NVIDIA repo that fails with the driver install (but sets up everything for it and CUDA) and then install the NVIDIA shell archive .run file -- crazy but simple and effective! ]

The NVIDIA display driver from the current cuda 7.0-28 repo does not support the GTX 980 Ti. I had cuda 7 installed on the system already when I added the 980 Ti to the mix. It is device 1 in the following nvidia-smi output.

[kinghorn@tower Downloads]$ nvidia-smi 
Mon Jun  1 15:31:35 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     Off  | 0000:03:00.0     N/A |                  N/A |
| 26%   37C    P8    N/A /  N/A |    116MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  ERR!                Off  | 0000:04:00.0     N/A |                  N/A |
| 22%   36C    P8    N/A /  N/A |     20MiB /  6143MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  Graphics Device     Off  | 0000:81:00.0     Off |                  N/A |
| 22%   31C    P8    15W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Note: Device 2 is the Titan X which shows up as “Graphics Device” with the 346.46 display driver. Device 1 with the "ERR!" is the GTX 980 Ti.

After the 352.09 beta driver install we have,

[kinghorn@tower ~]$ nvidia-smi 
Mon Jun  1 15:44:00 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.09     Driver Version: 352.09         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     Off  | 0000:03:00.0     N/A |                  N/A |
| 27%   46C    P2    N/A /  N/A |     81MiB /  4095MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  Graphics Device     Off  | 0000:04:00.0     N/A |                  N/A |
| 22%   45C    P8    N/A /  N/A |     20MiB /  6143MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 0000:81:00.0     Off |                  N/A |
| 22%   39C    P8    16W / 250W |     23MiB / 12287MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Yea! The GTX 980 Ti, device 1, is working and now has the distinguished title "Graphics Device" and the Titan X gets it's proper name.

For a quick test of the CUDA single precision floating point performance here’s the results from running benchmark mode with the nbody simulation from the cuda samples.

The GTX 980 Ti is going to be a great card for single precision compute loads!

Following is some more detailed output from the nbody simulation and deviceQuery output for your enjoyment :-)

[kinghorn@tower release]$ nvidia-smi -L
GPU 0: GeForce GTX 980 (UUID: GPU-477f0fd5-9db5-a015-e8b3-15ac96a06920)
GPU 1: Graphics Device (UUID: GPU-42f87dff-242a-17b2-7faf-2b4e18aec0d8)
GPU 2: GeForce GTX TITAN X (UUID: GPU-f195b1fa-16ec-ea58-8a17-146c0f93930e)
[kinghorn@tower release]$ ./nbody -benchmark -numbodies=256000 -device=0
...
gpuDeviceInit() CUDA Device [0]: "GeForce GTX TITAN X
> Compute 5.2 CUDA device: [GeForce GTX TITAN X]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 3598.562 ms
= 182.117 billion interactions per second
= 3642.344 single-precision GFLOP/s at 20 flops per interaction

[kinghorn@tower release]$ ./nbody -benchmark -numbodies=256000 -device=1
...
gpuDeviceInit() CUDA Device [1]: "Graphics Device
> Compute 5.2 CUDA device: [Graphics Device]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 3906.858 ms
= 167.746 billion interactions per second
= 3354.921 single-precision GFLOP/s at 20 flops per interaction

[kinghorn@tower release]$ ./nbody -benchmark -numbodies=256000 -device=2
...
gpuDeviceInit() CUDA Device [2]: "GeForce GTX 980
> Compute 5.2 CUDA device: [GeForce GTX 980]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 5129.597 ms
= 127.761 billion interactions per second
= 2555.210 single-precision GFLOP/s at 20 flops per interaction
[kinghorn@tower release]$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 3 CUDA Capable device(s)

Device 0: "GeForce GTX TITAN X"
  CUDA Driver Version / Runtime Version          7.5 / 7.0
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 12288 MBytes (12884705280 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            1076 MHz (1.08 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 129 / 0
  Compute Mode:
      Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) 

Device 1: "Graphics Device"
  CUDA Driver Version / Runtime Version          7.5 / 7.0
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 6144 MBytes (6442254336 bytes)
  (22) Multiprocessors, (128) CUDA Cores/MP:     2816 CUDA Cores
  GPU Max Clock rate:                            1076 MHz (1.08 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
      Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) 

Device 2: "GeForce GTX 980"
  CUDA Driver Version / Runtime Version          7.5 / 7.0
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 4095 MBytes (4294246400 bytes)
  (16) Multiprocessors, (128) CUDA Cores/MP:     2048 CUDA Cores
  GPU Max Clock rate:                            1216 MHz (1.22 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0
  Compute Mode:
      Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) 
> Peer access from GeForce GTX TITAN X (GPU0) -> Graphics Device (GPU1) : No
> Peer access from GeForce GTX TITAN X (GPU0) -> GeForce GTX 980 (GPU2) : No
> Peer access from Graphics Device (GPU1) -> Graphics Device (GPU1) : No
> Peer access from Graphics Device (GPU1) -> GeForce GTX 980 (GPU2) : No
> Peer access from Graphics Device (GPU1) -> GeForce GTX TITAN X (GPU0) : No
> Peer access from Graphics Device (GPU1) -> Graphics Device (GPU1) : No
> Peer access from GeForce GTX 980 (GPU2) -> GeForce GTX TITAN X (GPU0) : No
> Peer access from GeForce GTX 980 (GPU2) -> Graphics Device (GPU1) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.0, NumDevs = 3, Device0 = GeForce GTX TITAN X, Device1 = Graphics Device, Device2 = GeForce GTX 980
Result = PASS

Happy computing! --dbk >/p>

Tags: NVIDIA, GTX980 Ti, Titan X, GTX 980, CUDA, Linux
Dr.Madhav

This review isn't comprehensive at all. For example, on at least three counts:

1) Double Precision: I've to regularly deal with Single and Double demands. On the latter case, those Maxwell children get decimated when compared to 'elderly statesman' like Titan Black! :)

2) Power Efficiency: It matters a lot to people like me who invest by having multi-pronged strategy. E.g,, bringing down overall company IT operational costs by deploying Solar power solution. Initial upfront investment increases, but it helps a lot on the long run to save some good money. From this perspective, each and every chosen CPU, GPU, RAM kit et al matters on an annual scale.

3) NVIDIA alternatives: I know that this article is about CUDA. However, I must add here that I chose to go for the Computational ecosystem based on Nvidia and AMD pro cards in order to get the best out of the both worlds. I understand certain skills and work flow varies to an extent, but those FirePros are excellent cards too. I'm quite happy with the performance of W9100 and W8100 cards.

Posted on 2015-08-25 10:06:18