NVIDIA Jetson TK1 CUDA performance

One of the many nifty devices on display at the 2014 GTC meeting was the Jetson TK1 developer board. It was used in an autonomously driven Audi that cruised onto stage by itself during Jen Hsun Huang’s keynote and GE had a Jetson connected to a high res camera and a LiDAR sensor doing real time tracking and range finding. Pretty interesting applications of a little 5W Tegra ARM board with a Kepler GPU onboard. I pre-ordered a couple boards as soon as I got back from the conference. They have arrived!

It’s a interesting dev board based around an ARM Cortex A15 and a Kepler GPU with 192 CUDA cores … and it costs, ahhh, $192 🙂

My second board is in the nice little protective acrylic case that Matt Bach made… I'll try to talk him into making a few more 🙂 edit: He agreed! You can find them at this link, Jetson TK1 case.

Board features

Kepler GPU with 192 CUDA cores
ARM Cortex A15 4+1 quad core CPU
2 GB sys memory, 16GB eMMC memory, SD/MMC slot, SATA
HDMI, 1 USB2 (micro), 1 USB3, RS232 serial port, Realtek Audio, GigE LAN
JTAG port and a 125-pin expansion connector with lots of interesting protocol support
… https://developer.nvidia.com/jetson-tk1

The board comes with Linux For Tegra R19.2 installed. This is an Ubuntu 14.04 ARM build for Tegra complete with the “Unity” desktop. (…which is the only desktop that I like less than Windows 8! … I'll fix that later.)

Once you start up the board you can install the NVIDIA drivers and desktop environment using the installer and tar file in the default user account. Then grab the CUDA SDK repo from the CUDA ZONE (you will need to be a registered developer to get this) then install the the repo with dpkg and then install CUDA from the package manager. Copy the /usr/local/cuda/samples directory to you home directory and do “make”. In a little while you will have a whole bunch of CUDA goodness to play with.

As a quick benchmark I’ve run the nbody code from the CUDA sample directory on the Jetson and a Peak Mini with a Titan and a Tesla K40. Here’s the results:

nbody -benchmark -numbodies=65536
System	GFLOPS (single precision)
Jetson TK1	157.592
Tesla K40	1704.071
Titan	1894.420
*Tesla + Titan	5739.912
**Jetson ARM CPU	0.076
**Intel i5-4570	0.820

158 GFLOPS is pretty impressive for a 5W developer board!

* The nbody sample CUDA code will run on multiple cards and it appears to have scaled superlinearly! (probably from the extra memory available from using two cards)

** The CPU run from the nbody code is not optimized and runing with a single thread, … so yeah, it's that bad! I had to cut the number of "bodies" down to 8192 to get it to finish in a reasonable amount of time.

Here's some of the job run output for your number loving enjoyment!;

+++ Jetson from desktop+++
ubuntu@tegra-ubuntu:~/CUDA/samples/bin/armv7l/linux/release/gnueabihf$ ./nbody -benchmark -numbodies=65536

GPU Device 0: "GK20A" with compute capability 3.2

> Compute 3.2 CUDA device: [GK20A]
number of bodies = 65536
65536 bodies, total time for 10 iterations: 5450.757 ms
= 7.880 billion interactions per second
= 157.592 single-precision GFLOP/s at 20 flops per interaction

+++Tesla K40+++
[kinghorn@mini release]$ ./nbody -benchmark -numbodies=65536 

GPU Device 0: "Tesla K40c" with compute capability 3.5

> Compute 3.5 CUDA device: [Tesla K40c]
number of bodies = 65536
65536 bodies, total time for 10 iterations: 504.083 ms
= 85.204 billion interactions per second
= 1704.071 single-precision GFLOP/s at 20 flops per interaction

+++Titan+++
[kinghorn@mini release]$ ./nbody -benchmark -numbodies=65536 -device=1

gpuDeviceInit() CUDA Device [1]: "GeForce GTX TITAN
> Compute 3.5 CUDA device: [GeForce GTX TITAN]
number of bodies = 65536
65536 bodies, total time for 10 iterations: 453.433 ms
= 94.721 billion interactions per second
= 1894.420 single-precision GFLOP/s at 20 flops per interaction

+++Tesla+Titan+++
[kinghorn@mini release]$ ./nbody -benchmark -numbodies=65536 -numdevices=2

number of CUDA devices  = 2
GPU Device 0: "Tesla K40c" with compute capability 3.5
> Compute 3.5 CUDA device: [Tesla K40c]
> Compute 3.5 CUDA device: [GeForce GTX TITAN]

number of bodies = 65536
65536 bodies, total time for 10 iterations: 149.653 ms
= 286.996 billion interactions per second
= 5739.912 single-precision GFLOP/s at 20 flops per interaction

+++Jetson ARM+++
ubuntu@tegra-ubuntu:~/CUDA/samples/bin/armv7l/linux/release/gnueabihf$ ./nbody -benchmark -numbodies=8192 -cpu

> Simulation with CPU
number of bodies = 8192
8192 bodies, total time for 10 iterations: 176752.297 ms
= 0.004 billion interactions per second
= 0.076 single-precision GFLOP/s at 20 flops per interaction

+++i5 4570+++
[kinghorn@mini release]$ ./nbody -benchmark -numbodies=8192 -cpu

> Simulation with CPU
number of bodies = 8192
8192 bodies, total time for 10 iterations: 16362.027 ms
= 0.041 billion interactions per second
= 0.820 single-precision GFLOP/s at 20 flops per interaction

Problems!

I had one strange problem. When I first did my testing I was logged into the system over ssh and the results of the job runs we 10 times slower than when I later ran from a terminal opened from the desktop! The same thing happened if I ran the tests from one of the consoles (tty1). I dont’ know if this is an Ubuntu problem or just something strange with the setup for ARM. I did the Tesla and Titan runs on a machine over an ssh connection to CentOS 6.5 and they ran as expected. If I figure out what’s going on I’ll let you know in the comments.

Happy computing –dbk

Tags: CUDA, Jetson, NVIDIA