Table of Contents
We’ve got a quick test of CUDA performance on three generations of NVIDIA’s Titan X GPU’s for you. NVIDA released the Pascal GeForce Titan X much earlier than people were expecting (including NVIDIA :-). Just like the older Titan cards the new Pascal based card did not disappoint! For comparison we have a GTX 1080 in the mix too.
This is just a brief look at performance using the nbody code from the CUDA samples and a Molecular Dynamics simulation (stmv)using NAMD.
The short story is — the Titan X (Pascal) is an amazing video card and CUDA compute performance is stunning!
The setup for this testing is Ubuntu 16.04 with CUDA 8.0 RC ( as of the date of this writing you need to be an NVIDIA registered developer to access the release candidate. ) The NVIDIA display driver is 367.35 from the “Graphics-Drivers ppa”
Base systems
- The Peak Tower Single
- CPU: Intel Core-i7 6900K 8-core @ 3.2GHz (3.5GHz All-Core-Turbo)
- Memory: 64 GB DDR4 2133MHz Reg ECC
- PCIe: (4) X16-X16 v3
- The Peak Tower Quad
- CPU: (4) Intel Xeon E7 8867v3 16-core @ 2.5GHz (2.7GHz All-Core-Turbo)
- Memory: 512 GB DDR4 2133GHz Reg ECC
- PCIe: (4) X16-X16 v3
Note: The Quad was used for NAMD for two reasons: 1) I had it on the bench and 2) The GPU’s are so fast that I wanted lots of CPU performance so we could see the difference between the GPU’s! (I’m also using some older v3 CPU’s we use E5/E7 v4 in our current Quad builds)
Video cards used for testing. ( data from nvidia-smi )
Card | CUDA cores | GPU clock MHz | Memory clock MHz* | Application clock MHz** | FB Memory MiB |
---|---|---|---|---|---|
Titan X (Pascal) | 3584 | 1911 | 5005 | 1417 | 12186 |
TITAN X (Maxwell) | 3072 | 1392 | 3505 | 1000 | 12206 |
Titan Black | 2880 | 1202 | 3500 | 888 | 6082 |
GTX 1080 | 2560 | 1911 | 5005 | 1607 | 8113 |
Results
Card | nbody single precision GFLOP/s | NAMD run time (sec) | NAMD day/ns |
---|---|---|---|
Titan X (Pascal) | 7507 | 41 | 0.570 |
TITAN X (Maxwell) | 4292 | 55 | 0.889 |
Titan Black | 2302 | 81 | 1.460 |
GTX 1080 | 5429 | 48 | 0.709 |
Notes:
nbody -benchmark -numbodies=256000 -device= {one of (0,1,2,3)}
namd2 +p 64 +setcpuaffinity stmv.namd
This is more CPU cores than needed for balance with the GPU but I wanted the GPU to be performance limiting.
I’m looking forward to setting up NVIDIA DIGITS 4 (just released to developers) and seeing what kind performance we see with training sets on Caffe using the new Titan X.
Extra data … if you like that sort of thing
kinghorn@u16ps:~/testing/samples-8.0/bin/x86_64/linux/release$ ./bandwidthTest 0 [CUDA Bandwidth Test] - Starting... Running on... Device 0: TITAN X Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 11853.2 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12854.1 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 342979.6 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
kinghorn@u16ps:~/testing/samples-8.0/bin/x86_64/linux/release$ ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 4 CUDA Capable device(s) Device 0: "TITAN X" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 12186 MBytes (12778274816 bytes) (28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores GPU Max Clock rate: 1531 MHz (1.53 GHz) Memory Clock rate: 5005 Mhz Memory Bus Width: 384-bit L2 Cache Size: 3145728 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0 Compute Mode: - Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) - Device 1: "GeForce GTX TITAN Black" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 3.5 Total amount of global memory: 6082 MBytes (6377439232 bytes) (15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores GPU Max Clock rate: 980 MHz (0.98 GHz) Memory Clock rate: 3500 Mhz Memory Bus Width: 384-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 9 / 0 Compute Mode: - Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) - Device 2: "GeForce GTX TITAN X" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 12207 MBytes (12799574016 bytes) (24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores GPU Max Clock rate: 1076 MHz (1.08 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 384-bit L2 Cache Size: 3145728 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 6 / 0 Compute Mode: - Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) - Device 3: "GeForce GTX 1080" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 8113 MBytes (8507555840 bytes) (20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores GPU Max Clock rate: 1734 MHz (1.73 GHz) Memory Clock rate: 5005 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 10 / 0 Compute Mode: - Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) - > Peer access from TITAN X (GPU0) -> GeForce GTX TITAN Black (GPU1) : No > Peer access from TITAN X (GPU0) -> GeForce GTX TITAN X (GPU2) : No > Peer access from TITAN X (GPU0) -> GeForce GTX 1080 (GPU3) : No > Peer access from GeForce GTX TITAN Black (GPU1) -> TITAN X (GPU0) : No > Peer access from GeForce GTX TITAN Black (GPU1) -> GeForce GTX TITAN X (GPU2) : No > Peer access from GeForce GTX TITAN Black (GPU1) -> GeForce GTX 1080 (GPU3) : No > Peer access from GeForce GTX TITAN X (GPU2) -> TITAN X (GPU0) : No > Peer access from GeForce GTX TITAN X (GPU2) -> GeForce GTX TITAN Black (GPU1) : No > Peer access from GeForce GTX TITAN X (GPU2) -> GeForce GTX 1080 (GPU3) : No > Peer access from GeForce GTX 1080 (GPU3) -> TITAN X (GPU0) : No > Peer access from GeForce GTX 1080 (GPU3) -> GeForce GTX TITAN Black (GPU1) : No > Peer access from GeForce GTX 1080 (GPU3) -> GeForce GTX TITAN X (GPU2) : No deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 4, Device0 = TITAN X, Device1 = GeForce GTX TITAN Black, Device2 = GeForce GTX TITAN X, Device3 = GeForce GTX 1080 Result = PASS
Happy computing –dbk