NVIDIA Titan GPUs (3 generations) – CUDA 8 rc performance on Ubuntu 16.04

We’ve got a quick test of CUDA performance on three generations of NVIDIA’s Titan X GPU’s for you. NVIDA released the Pascal GeForce Titan X much earlier than people were expecting (including NVIDIA :-). Just like the older Titan cards the new Pascal based card did not disappoint! For comparison we have a GTX 1080 in the mix too.

This is just a brief look at performance using the nbody code from the CUDA samples and a Molecular Dynamics simulation (stmv)using NAMD.

The short story is — the Titan X (Pascal) is an amazing video card and CUDA compute performance is stunning!

The setup for this testing is Ubuntu 16.04 with CUDA 8.0 RC ( as of the date of this writing you need to be an NVIDIA registered developer to access the release candidate. ) The NVIDIA display driver is 367.35 from the “Graphics-Drivers ppa”

Base systems

The Peak Tower Single
CPU: Intel Core-i7 6900K 8-core @ 3.2GHz (3.5GHz All-Core-Turbo)
Memory: 64 GB DDR4 2133MHz Reg ECC
PCIe: (4) X16-X16 v3
The Peak Tower Quad
CPU: (4) Intel Xeon E7 8867v3 16-core @ 2.5GHz (2.7GHz All-Core-Turbo)
Memory: 512 GB DDR4 2133GHz Reg ECC
PCIe: (4) X16-X16 v3

Note: The Quad was used for NAMD for two reasons: 1) I had it on the bench and 2) The GPU’s are so fast that I wanted lots of CPU performance so we could see the difference between the GPU’s! (I’m also using some older v3 CPU’s we use E5/E7 v4 in our current Quad builds)

Video cards used for testing. ( data from nvidia-smi )

Card CUDA cores GPU clock MHz Memory clock MHz* Application clock MHz** FB Memory MiB
Titan X (Pascal) 3584 1911 5005 1417 12186
TITAN X (Maxwell) 3072 1392 3505 1000 12206
Titan Black 2880 1202 3500 888 6082
GTX 1080 2560 1911 5005 1607 8113

Results

Card nbody single precision GFLOP/s NAMD run time (sec) NAMD day/ns
Titan X (Pascal) 7507 41 0.570
TITAN X (Maxwell) 4292 55 0.889
Titan Black 2302 81 1.460
GTX 1080 5429 48 0.709

Notes:

nbody -benchmark -numbodies=256000 -device= {one of (0,1,2,3)}
namd2 +p 64 +setcpuaffinity stmv.namd
This is more CPU cores than needed for balance with the GPU but I wanted the GPU to be performance limiting.

I’m looking forward to setting up NVIDIA DIGITS 4 (just released to developers) and seeing what kind performance we see with training sets on Caffe using the new Titan X.

Extra data … if you like that sort of thing

kinghorn@u16ps:~/testing/samples-8.0/bin/x86_64/linux/release$ ./bandwidthTest 0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: TITAN X
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			11853.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12854.1

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			342979.6

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

kinghorn@u16ps:~/testing/samples-8.0/bin/x86_64/linux/release$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "TITAN X"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 12186 MBytes (12778274816 bytes)
  (28) Multiprocessors, (128) CUDA Cores/MP:     3584 CUDA Cores
  GPU Max Clock rate:                            1531 MHz (1.53 GHz)
  Memory Clock rate:                             5005 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 5 / 0
  Compute Mode:
     - Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) -

Device 1: "GeForce GTX TITAN Black"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 6082 MBytes (6377439232 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            980 MHz (0.98 GHz)
  Memory Clock rate:                             3500 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 9 / 0
  Compute Mode:
     - Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) -

Device 2: "GeForce GTX TITAN X"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 12207 MBytes (12799574016 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            1076 MHz (1.08 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 6 / 0
  Compute Mode:
     - Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) -

Device 3: "GeForce GTX 1080"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8113 MBytes (8507555840 bytes)
  (20) Multiprocessors, (128) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1734 MHz (1.73 GHz)
  Memory Clock rate:                             5005 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 10 / 0
  Compute Mode:
     - Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) -
> Peer access from TITAN X (GPU0) -> GeForce GTX TITAN Black (GPU1) : No
> Peer access from TITAN X (GPU0) -> GeForce GTX TITAN X (GPU2) : No
> Peer access from TITAN X (GPU0) -> GeForce GTX 1080 (GPU3) : No
> Peer access from GeForce GTX TITAN Black (GPU1) -> TITAN X (GPU0) : No
> Peer access from GeForce GTX TITAN Black (GPU1) -> GeForce GTX TITAN X (GPU2) : No
> Peer access from GeForce GTX TITAN Black (GPU1) -> GeForce GTX 1080 (GPU3) : No
> Peer access from GeForce GTX TITAN X (GPU2) -> TITAN X (GPU0) : No
> Peer access from GeForce GTX TITAN X (GPU2) -> GeForce GTX TITAN Black (GPU1) : No
> Peer access from GeForce GTX TITAN X (GPU2) -> GeForce GTX 1080 (GPU3) : No
> Peer access from GeForce GTX 1080 (GPU3) -> TITAN X (GPU0) : No
> Peer access from GeForce GTX 1080 (GPU3) -> GeForce GTX TITAN Black (GPU1) : No
> Peer access from GeForce GTX 1080 (GPU3) -> GeForce GTX TITAN X (GPU2) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X, Device1 = GeForce GTX TITAN Black, Device2 = GeForce GTX TITAN X, Device3 = GeForce GTX 1080
Result = PASS

Happy computing –dbk