NVIDIA Quadro GP100 Tesla P100 power on your desktop

This is an update for the post I did on Feb 5th 2017. I have access to the cards again and realized I had messed up some of the testing on the first post. These results are much better. This is still a very brief look at performance on the Quadro GP100, namely, code builds from the CUDA samples, nbody and p2pbandwidthLatencyTest. This a just a quick peek. Hopefully we'll be able to include these great cards in all of our systems testing going forward.

NVIDIA GP100 with NVLINK in Puget Systems Peak Mini

I got brief chance to play with a couple of the new  NVIDIA Quadro GP100's. The GP100 is effectively a Tesla P100 with NVLINK together with high-end Quadro display capability. Nice!

The big thing to note is that this is a full NVIDIA Tesla P100 Pascal GPU compute engine together with Quadro video capability.  You can use this a workstation display device!

The Quadro GP100 shares the same GPU as the P100 Tesla card and in fact has an advantage over the PCIe version of the P100 Tesla since the Quadro card has 2-way NVLINK. From my quick testing it looks to be the same performance that would be expected from the Tesla + NVLINK card. When NVIDIA provides us with full specs I'll post that. 

Make no mistake the Quadro GP100 is a "Quadro" card! It has all the high-end workstation graphics capability you would expect in a top-of-the-line Quadro card. AND, you have compute performance of the highest end Tesla. The Quadro GP100 can form the base of a  "super-workstation". I feel it is the most exciting "video card" NVIDIA has produced to date! 

I had two of the Quadro GP100's running on our "DIGITS" test system with Ubuntu 16.04 CUDA 8.0.61 and NVIDIA driver nvidia-375. I just had the cards for a short time so I compiled a couple of testing programs out the CUDA samples to see how they performed. 

One of the standout features of the GP100 is the double precision floating point performance of the GPU. It's nearly 10 times that of a GeForce Titan X Pascal card.   [ note: the multi-GPU scaling with GP100 and NVLINK seems to much better than Titan X. I'll need to do more testing for that … ]

nbody benchmark using -fp64 [ double precision ]

(1) GP100 : 2977 GFLOP/s

(2) GP100 : 5605 GFLOP/s

(1) Titan X:   307 GFLOP/s

Another standout performance feature is the peer-to-peer (card-to-card) bandwidth over NVLINK

​p2pBandwidthLatencyTest  2 x GP100 [ Bidirectional P2P=Enabled Bandwidth]

NVLINK ON :    146 GB/s

NVLINK OFF :  26 GB/s

This is a very interesting "video card"!  I'm looking forward to doing more testing with these.  

 Below I have some of the raw output from the test job runs so you can see the numbers for yourself.


 
./nbody -benchmark -numbodies=256000 -device=0
Single Precision
 
gpuDeviceInit() CUDA Device [0]: "Quadro GP100 
> Compute 6.0 CUDA device: [Quadro GP100] 
number of bodies = 256000 
256000 bodies, total time for 10 iterations: 2071.087 ms 
= 316.433 billion interactions per second 
= 6328.658 single-precision GFLOP/s at 20 flops per interaction 
 
 
./nbody -benchmark -numbodies=256000 -numdevices=2 
 
> Compute 6.0 CUDA device: [Quadro GP100]
> Compute 6.0 CUDA device: [Quadro GP100]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 1610.448 ms
= 406.943 billion interactions per second
= 8138.853 single-precision GFLOP/s at 20 flops per interaction

 
./nbody -benchmark -fp64 -numbodies=256000 -numdevices=1 (2) 
Double precision  
 
> Compute 6.0 CUDA device: [Quadro GP100] 
number of bodies = 256000 
256000 bodies, total time for 10 iterations: 6605.276 ms 
= 99.218 billion interactions per second 
= 2976.530 double-precision GFLOP/s at 30 flops per interaction 
 
> Compute 6.0 CUDA device: [Quadro GP100] 
> Compute 6.0 CUDA device: [Quadro GP100] 
number of bodies = 256000
256000 bodies, total time for 10 iterations: 3501.325 ms
= 187.175 billion interactions per second
= 5615.246 double-precision GFLOP/s at 30 flops per interaction

Bandwidth results: 
 
kinghorn@u1604docker:~/testing/samples/bin/x86_64/linux/release$ ./bandwidthTest --device=0 
[CUDA Bandwidth Test] - Starting... 
Running on... 
 
Device 0: Quadro GP100 
Quick Mode 
 
Host to Device Bandwidth, 1 Device(s) 
PINNED Memory Transfers 
   Transfer Size (Bytes)    Bandwidth(MB/s) 
   33554432            11662.4 
 
Device to Host Bandwidth, 1 Device(s) 
PINNED Memory Transfers 
   Transfer Size (Bytes)    Bandwidth(MB/s) 
   33554432            12865.6 
 
Device to Device Bandwidth, 1 Device(s) 
PINNED Memory Transfers 
   Transfer Size (Bytes)    Bandwidth(MB/s) 
   33554432            495701.3 
 
Result = PASS 
 
Peer-To-Peer 
 
kinghorn@udocker:~/projects/samples/bin/x86_64/linux/release$ ./p2pBandwidthLatencyTest 
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Quadro GP100, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 1, Quadro GP100, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 503.50  11.07 
     1  11.15 502.35 
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 505.34  72.88 
     1  72.83 502.33 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 515.50  18.92 
     1  19.03 512.78 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 513.58 145.57 
     1 146.28 512.64 
P2P=Disabled Latency Matrix (us)
   D\D     0      1 
     0   2.64  14.70 
     1  14.49   2.61 
P2P=Enabled Latency Matrix (us)
   D\D     0      1 
     0   2.61   6.85 
     1   6.92   2.60 


NVLINK OFF Peer-To-Peer 
 
kinghorn@u1604docker:~/testing/samples/bin/x86_64/linux/release$ ./p2pBandwidthLatencyTest  
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test] 
Device: 0, Quadro GP100, pciBusID: 6, pciDeviceID: 0, pciDomainID:0 
Device: 1, Quadro GP100, pciBusID: 5, pciDeviceID: 0, pciDomainID:0 
Device=0 CAN Access Peer Device=1 
Device=1 CAN Access Peer Device=0 
 
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. 
So you can see lesser Bandwidth (GB/s) in those cases. 
 
P2P Connectivity Matrix 
     D\D     0     1 
     0         1     1 
     1         1     1 
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) 
   D\D     0      1  
     0 500.52   9.91  
     1   9.93 501.73  
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s) 
   D\D     0      1  
     0 503.61  12.67  
     1  12.25 501.56  
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) 
   D\D     0      1  
     0 511.59  10.52  
     1  10.49 512.03  
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) 
   D\D     0      1  
     0 511.05  26.28  
     1  26.25 509.48  
P2P=Disabled Latency Matrix (us) 
   D\D     0      1  
     0   2.70  17.47  
     1  17.36   2.62  
P2P=Enabled Latency Matrix (us) 
   D\D     0      1  
     0   2.60   9.15  
     1   9.40   2.62  

Happy computing! –dbk