If you have done a fresh install of CentOS 6.6 or “updated” to it from a 6.5 install and you are setting up NVIDIA CUDA 6.5 you may be having trouble with a failed build of the nvidia-uvm kernel module. Read on for a fix …
I was setting up a machine for GPU compute using NVIDIA CUDA a few days ago and hit a snag. I did a CentOS 6.5 install, ran updates and noticed that it had updated to release 6.6 (which had just gotten to the repo mirrors). I checked the NVIDA CUDA download pages and saw there wasn’t any specific CUDA release for RHEL / CentOS 6.6 I decided to go ahead and try the setup using the cuda repo package anyway.
The CUDA yum repo makes it really easy to setup CUDA and get the NVIDIA drivers working. Just grab the repo rpm
install that with yum and then do
yum install cuda
Everything looked fine during the install so I compiled the code from the samples directory and fired up the nbody job as a test and I see the following,
FATAL: Module nvidia_uvm not found. Error: only 0 Devices available, 1 requested. Exiting.
Trying to load the module gives,
[root@tower cuda]# modprobe nvidia-uvm FATAL: Module nvidia_uvm not found.
Then starts the long process of trying to track down a log file to see what happened … finally found this,
/var/lib/dkms/nvidia-uvm/340.29/build/make.log [root@tower build]# cat make.log DKMS make.log for nvidia-uvm-340.29 for kernel 2.6.32-504.el6.x86_64 (x86_64) Mon Nov 3 16:50:28 PST 2014 Makefile:213: /var/lib/dkms/nvidia/340.29/build/nvidia-modules-common.mk: No such file or directory make: *** No rule to make target `/var/lib/dkms/nvidia/340.29/build/nvidia-modules-common.mk'. Stop.
OK, so it looks like there is a problem in the “nvida” directory not the “nvidia-uvm” directory.
Looking at the directories in the “nvidia-uvm” directory I see,
[kinghorn@tower 340.29]$ ls /var/lib/dkms/nvidia-uvm/340.29/ build source
and in the “nvidia” directory we have,
[kinghorn@tower 340.29]$ ls /var/lib/dkms/nvidia/340.29/ 2.6.32-504.el6.x86_64 source
THERE IS NO “build” DIRECTORY! However the “source” directory does contain the nvidia-modules-common.mk file, so …
HERE’S THE FIX:
create a symbolic link from source to build,
[root@tower 340.29]# ln -s source build [root@tower 340.29]# ls -l total 4 drwxr-xr-x 3 root root 4096 Nov 3 16:46 2.6.32-504.el6.x86_64 lrwxrwxrwx 1 root root 6 Nov 3 17:31 build -> source lrwxrwxrwx 1 root root 22 Nov 3 16:46 source -> /usr/src/nvidia-340.29 **** reboot ****
Now dkms triggers correctly, the modules build and we have CUDA joy!
Happy computing! --dbk