Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1170
Dr Donald Kinghorn (Scientific Computing Advisor )

Install TensorFlow with GPU Support the Easy Way on Ubuntu 18.04 (without installing CUDA)

Written on May 25, 2018 by Dr Donald Kinghorn
Share:


TensorFlow is a very important Machine/Deep Learning framework and Ubuntu Linux is a great workstation platform for this type of work. If you are wanting to setup a workstation using Ubuntu 18.04 with CUDA GPU acceleration support for TensorFlow then this guide will hopefully help you get your machine learning environment up and running without a lot of trouble. And, you don't have to do a CUDA install!

This guide is for Ubuntu 18.04 but I will also be doing a similar post using the latest Windows 10 build.

Ubuntu 18.04 is out and in my opinion it is a big improvement over 16.04. 18.04 is the latest LTS (Long Term Support) build of Ubuntu. It will become the standard base platform for a lot of projects. There is usually some lag time before packages and projects move to a new base platform like Ubuntu 18.04, however, at this point nearly all of the projects that I care about are already supported on 18.04.

I said "nearly all" ...! Right now I have Ubuntu 18.04 running supported version of Docker, NVIDIA-docker v2, Virtualbox, Anaconda Python, etc, there is only one package that I generally install that is not (officially) supported on 18.04 yet. That one package is NVIDIA CUDA. I had waited to write anything about Ubuntu 18.04 until CUDA 9.2 was released because I was sure it would have install support for 18.04. Well, guess what, it doesn't. For Ubuntu the recent 9.2 CUDA release only has installer support for 16.04 and 17.10! I was really surprised to see that. 16.04 makes sense but 17.10 is a short term intermediate release and it is similar enough to 18.04 that I don't understand why 18.04 didn't happen. There may be something broken that they just decided to wait to fix rather than delay the 9.2 release any further. That would be understandable and reasonable.

I will do a detailed post on how to do an Ubuntu 18.04 install including an unofficial CUDA 9.2 install. In this post I am assuming you have successfully installed Ubuntu 18.04. If that is not the case then you may want to wait for my detailed install post.

If you are not doing CUDA development work then you may not need to install CUDA anyway. The focus here is to get a good GPU accelerated TensorFlow work environment up and running without a lot of fuss.

I'm adding a note here about some issues that have come up in the comments. If you see the following error when you try to run a TF or Keras job it's because your NVIDIA display driver is not new enough for the new TF and Keras builds on the Anaconda cloud i.e. they are are linking against CUDA libs that need the nvidia-396 runtime and you have the nvidia-390 runtime installed.
Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

You should update your NVIDIA driver!!. Use either the "Software and Updates" GUI tool under the "Driver" section or update the driver manually from the command line like,
sudo apt purge nvidia*
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt install nvidia-driver-430

Check the latest version here, https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa

Best wishes --Don


Python environment setup with Anaconda Python

I highly recommend you use Anaconda Python. If you need some arguments for using Python take a look at my post Should You Learn to Program with Python. For arguments on why you should use the Anaconda Python distribution see, How to Install Anaconda Python and First Steps for Linux and Windows.

Anaconda is focused toward data-science and machine learning. It installs cleanly on your system in a single directory so it doesn't make a mess in your systems application and library directories. It is also performance optimized and links important numerical packages like numpy to Intel's MKL. Most importantly for this post, it includes easily installed modules for TensorFlow that include the CUDA dependencies!

Install Anaconda Python

sha256sum Anaconda3-5.1.0-Linux-x86_64.sh
bash Anaconda3-5.1.0-Linux-x86_64.sh
  • You will be asked to accept a license agreement and then questioned about the install location. By default it will install at the top of your home directory under anaconda3. I recommend that you use that. [ If you ever want to get rid of it or reinstall you can just remove that directory.]
  • Next it will ask if you want to append the Anaconda executable directory to your PATH environment variable in .bashrc I recommend that you do that but, remember that you did. It will add something like the following at the end of your .bashrc file,
# added by Anaconda3 installer
export PATH="/home/dbk/anaconda3/bin:$PATH"
  • Then "re-source" your .bashrc file to execute that export. [ It will happen automatically on subsequent login. ]
source ~/.bashrc
  • Next you will be asked if you want to install Microsoft VSCode. VSCode is a really good editor and it is available for free on Windows, Linux and MacOS. However, if you are interested in trying it out I would recommend that you go to the VSCode website and check it out. If you you think you want to try it then go ahead and download it and install it yourself. I usually use the Atom editor which also runs on Windows, Linux and MacOS. If you are checking out editors I recommend you try both of these as well as Sublime Text. They are all great editors!
  • Check your install. If you have you sourced your .bashrc file and your PATH is correct you should see something like,
python --version

Python 3.6.4 :: Anaconda, Inc.
  • Update your base Anaconda packages. (conda is a powerful package and environment management tool for Anaconda and it's not restricted to use with just Python)
conda update conda
conda update anaconda
conda update python
conda update --all

That should bring your entire base Anaconda install up to the latest packages.

There is a GUI for Anaconda called anaconda-navigator. I personally find it distracting/confusing/annoying and prefer using conda from the command-line. Your taste may differ! ... and my opinion is subject to change if they keep improving it.


Create a Python "virtual environment" for TensorFlow using conda

You should set up an environment for TensorFlow separate from your base Anaconda environment. This keeps your base clean and will give TensorFlow a space for all of it's dependencies. It is in general good practice to keep separate environments for projects especially when they have special package dependencies.

There are many possible options when creating an environment with conda including adding packages with specific version numbers and specific Python base versions. This is sometimes useful if you want fine control and it also helps with version dependencies resolution. Here we will keep it simple and just create a named environment and then activate that environment and install the packages we want inside of that.

From a command line do,

conda create --name tf-gpu

I named the environment 'tf-gpu' but you can use any name you want.

Now activate the environment, (I'll show my full terminal prompt and output instead of just the commands)

dbk@i9:~$ source activate tf-gpu
(tf-gpu) dbk@i9:~$

You can see that my shell prompt is now preceded by the the name of the environment.

Note: the newer 'conda' uses a different syntax now, "conda activate tf-gpu" and "conda deactivate tf-gpu"

Install TensorFlow from the Anaconda Cloud Repositories

The TensorFlow documentation is in general very good but the install documentation does not present a very good way to get a setup working on a workstation.

Do not follow the install documentation from the TensorFlow site! If you do you will have a painful time getting things working and you will have a nearly impossible to maintain install setup.

There is no good reason to do an (old) CUDA install and a pip install when you are using Anaconda Python. There is an up-to-date official Anaconda package for TensorFlow with GPU acceleration that includes all of the needed CUDA dependencies and it is well optimized for performance.

Lets install TensorFlow with GPU acceleration and all of the dependencies.

(tf-gpu) dbk@i9:~$ conda install tensorflow-gpu

That's it! That's all you need to do!

Just running that one short command above gave the following list of packages to be installed. They are installed and isolated in the "tf-gpu" environment we created. There is no nasty mess on your system!

I've cut some of the packages out and just left the "most interesting" ones in this output listing.

The following NEW packages will be INSTALLED:
...
...
    cudatoolkit:       9.0-h13b8566_0         
    cudnn:             7.1.2-cuda9.0_0        
    cupti:             9.0.176-0              
...      
    intel-openmp:      2018.0.0-8             
    mkl:               2018.0.2-1             
    mkl_fft:           1.0.1-py36h3010b51_0   
    mkl_random:        1.0.1-py36h629b387_0   

    libgcc-ng:         7.2.0-hdf63c60_3       
    libgfortran-ng:    7.2.0-hdf63c60_3       
    libprotobuf:       3.5.2-h6f1eeef_0       
    libstdcxx-ng:      7.2.0-hdf63c60_3       
...    
    numpy:             1.14.3-py36hcd700cb_1  
    numpy-base:        1.14.3-py36h9be14a7_1  
...
    protobuf:          3.5.2-py36hf484d3e_0   
    python:            3.6.5-hc3d631a_2       
...     
    tensorboard:       1.8.0-py36hf484d3e_0   
    tensorflow:        1.8.0-hb11d968_0       
    tensorflow-base:   1.8.0-py36hc1a7637_0   
    tensorflow-gpu:    1.8.0-h7b35bdc_0       

You now have GPU accelerated TensorFlow 1.8, CUDA 9.0, cuDNN 7.1, Intel's MKL libraries (that are linked into numpy) and TensorBoard. Nice!

Note: the newer 'tensorflow-gpu' version will be updated from this.


Create a Jupyter Notebook Kernel for the TensorFlow Environment

You can work with an editor and the command line and you often want to do that, but, Jupyter notebooks are great for doing machine learning development work. In order to get Jupyter notebook to work the way you want with this new TensorFlow environment you will need to add a "kernel" for it.

With your tf-gpu environment activated do,

(tf-gpu) dbk@i9:~$ conda install ipykernel jupyter

Now create the Jupyter kernel,

(tf-gpu) dbk@i9:~$ python -m ipykernel install --user --name tf-gpu --display-name "TensorFlow-GPU"

With this "tf-gpu" kernel installed, when you open a Jupyter notebook you will now have an option to to start a new notebook with this kernel.
Jupyter kernel for TF


An Example using Keras with TensorFlow Backend

Note: I have a newer post that might be a better to follow than this example. How to Install TensorFlow with GPU Support on Windows 10 (Without Installing CUDA) UPDATED! Yes, even though that is a Win10 install everything after getting Anaconda Python working is pretty much the same on Windows and Linux!

In order to check everything out lets setup LeNet-5 using Keras (with our TensorFlow backend) using a Jupyter notebook with our "TensorFlow-GPU" kernel. We'll train the model on the MNIST digits data-set.

Install Keras

With the tf-gpu environment activated do,

(tf-gpu) dbk@i9:~$ conda install keras-gpu

You now have Keras installed utilizing your GPU accelerated TensorFlow. It is that easy!

Note: the newer 'tensorflow-gpu' includes Keras so you don't need to do a seperate install.

Launch a Jupyter Notebook

With the tf-gpu environment activated start Jupyter,

(tf-gpu) dbk@i9:~$ jupyter notebook

From the 'New' drop-down menu select the 'TensorFlow-GPU' kernel that you added (as seen in the image in the last section). You can now start writing code!

MNIST example

Following are Python snippets you can copy into cells in your Jupyter notebook to setup and train LeNet-5 with MNIST digits data.

Import dependencies

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Flatten,  MaxPooling2D, Conv2D
from keras.callbacks import TensorBoard

Load and process the MNIST data

(X_train,y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(60000,28,28,1).astype('float32')
X_test = X_test.reshape(10000,28,28,1).astype('float32')

X_train /= 255
X_test /= 255

n_classes = 10
y_train = keras.utils.to_categorical(y_train, n_classes)
y_test = keras.utils.to_categorical(y_test, n_classes)

Create the LeNet-5 neural network architecture

model = Sequential()
model.add(Conv2D(32, kernel_size=(3,3), activation='relu', input_shape=(28,28,1)) )
model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())          
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(n_classes, activation='softmax'))

Compile the model

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Set log data to feed to TensorBoard for visual analysis

tensor_board = TensorBoard('./logs/LeNet-MNIST-1')

Train the model

model.fit(X_train, y_train, batch_size=128, epochs=15, verbose=1,
          validation_data=(X_test,y_test), callbacks=[tensor_board])

The results

After running that training for 15 epochs the last epoch gave,

Epoch 15/15
60000/60000 [==============================] - 5s 83us/step - loss: 0.0188 - acc: 0.9939 - val_loss: 0.0303 - val_acc: 0.9917

Not bad! Training accuracy 99.39% and Validation accuracy 99.17%


Look at the job run with TensorBoard

Start TensorBoard

 (tf-gpu) dbk@i9:~$ tensorboard --logdir=./logs --port 6006

It will give you an address similar to http://i9:6006 Open that in your browser and you will be greeted with (the wonderful) TensorBoard. These are the plots it had for that job run,
TensorBoard output

That was a model with 1.2 million training parameters and a dataset with 60,000 images. It took 1 minute and 9 seconds utilizing the NVIDIA GeForce 1080Ti in my system!

Happy computing! --dbk

Tags: Ubuntu 18.04, TensorFlow, CUDA, Intel MKL, Anaconda Python
Jay Chen

Great write up!

Posted on 2018-05-29 19:27:58
Chester Parrott

# Might be obvious, but also need to install Nvidia drivers first

sudo apt-get purge nvidia*
sudo add-apt-repository ppa:graphics-drivers
sudo apt-get update
sudo apt-get install nvidia-390
sudo reboot now

Posted on 2018-06-01 23:12:51
koundy

Thanks!

Posted on 2018-06-12 11:40:44
João Devezas

I wouldn't recommend blindly installing that driver, instead it mat be easier/wiser to let ubuntu auto-install the drivers:

ubuntu-drivers devices
ubuntu-drivers autoinstall

reboot

Posted on 2018-12-09 13:59:34
Donald Kinghorn

(never blindly install anything! ... but, the graphics-drivers ppa is maintained by really good people ... I wouldn't recommend "autoinstall")

You need to know what drivers are being installed! If you want to run any code built against CUDA 10 you will need a 410+ driver to support the runtime.

... but I've noticed that the built in device detection in Ubuntu is better now and they support reasonable drivers. I think they still pull from graphics-drivers ppa though ??

I've been able to update and upgrade display drivers using the GUI tool "Softeware & Update" under the "Additional Driver" tab. So this would probably be the preferred way to do this now. You can still pick which driver you want.

Posted on 2018-12-10 17:35:17
Bhuvanesh S K

Sir there's a new problem... I guess..
After updating cosmic cuttlefish Ubuntu 18.10
nvidia-smi command says...
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."
Please hep and give a solution to this prblm...

Posted on 2018-12-16 09:56:09
Donald Kinghorn

For some reason I'm not getting your full comment ... however, my first advise would be to stay with 18.04 (it's going to be an extended LTS release so Canonical is really standing behind it) ... but I will probably try 18.10 too :-) First thing is to get the driver working if nvidia-smi doesn't work then nothing else will. After that it might take some I would expect things to work with Anaconda ??? ... maybe I'll try it. I've been curious about changes in 18.10 ... just thought of something ... be sure you are using X and not Wayland for display ...

Posted on 2018-12-17 17:28:22
Bhuvanesh S K

Yeah that problem got fixed ... I just fixed broken files in the system .... But facing a new prblm here ... Segmentation fault core dumb in Tensorflow... The conda environment is not recognising the GPU ... GPU memory is not accessible
Tensorflow GPU version 1.12 and cudnn 7.2.1 and cuda 9.2 ... All were inatinsta through conda environment...

Posted on 2018-12-17 17:43:23
John

Thank you so much! It is really a PAIN!!! to follow the installation guide on tensorflow.org. I wish I could find this article earlier. GREAT JOB!

Posted on 2018-06-10 07:34:32
Ji Han Heo

Thank you so much! I've been struggling with installing GPU supported tensorflow with CUDA 9.2, cuDNN 7.1.4. on 16.04 LTS. Hours and days have I poured in trying so to no avail... apparently CUDA 9.2 isn't compatible with... I don't know what. I installed bionic beaver and followed your instructions here and voila it works!!! Thank you so much for saving my time!!!!!!

Posted on 2018-06-16 23:28:39
Soumitra

I followed the steps to create the environment. When running the example, I keep on getting the following error:

TypeError: softmax() got an unexpected keyword argument 'axis'.

Also, I would like to know where is cuda installed, since I do not see anything the following folder:

/usr/local/cuda

Any help / pointer to appropriate information is much appreciated.

Posted on 2018-07-22 20:20:35
Donald Kinghorn

I'm not sure about the error about softmax There might be a missing ) somewhere ?? or may a indentation problem ?? not sure...

The CUDA libs are packaged with the tf module. In my install they are in ~/anaconda3/envs/tf-gpu/lib

Posted on 2018-08-01 21:33:58
Binh Truong

Thank you so much for creating this write up. I was following other instruction on how to build Tensorflow from source and got completely lost.

I did have to install twice because of missing Chester steps. First time, it runs correctly, but 10 times slow. After installing Nvidida drivers first, uninstalling Anaconda, reinstalling everything here, I got better runtime at ~75s with 1070Ti

Posted on 2018-08-15 02:30:27
slow_learner

Great tutorial. I didn't know setting up tensorflow was this easy!

Posted on 2018-08-16 09:32:16
Carch

Thanks, man. Is there a way to launch a separate terminal that can access the already activated virtualenv so that I could use Tensorboard --logdir=./logs --port 6006? Please instruct, thanks.

Posted on 2018-08-21 10:47:10
Donald Kinghorn

... I've been on vacation for a bit ... You should be able to just fire up another terminal and "source activate tf-gpu" to give you another shell in the env I usually do that ... I like to have at least 2 shells open for an environment. ctrl-shift-T is your friend :-)

Posted on 2018-08-27 14:39:05
Carch

I thought there is another way. Thanks, man. You are a life saver. I tried to install Tensorflow for at least two weeks a couple months ago but failed. I decided to give it another shot a week ago, and I found your post. You have my deepest gratitude!!!

Posted on 2018-08-27 22:14:50
ramrad

Thanks for the excellent blog! I was able to install tf-gpu and keras. However when I execute the MNIST example in Jupyter, it does not use the GPUs on the system. It takes over 60 sec per epoch and nvidia-smi shows that the job is not executed on the GPUS which are idle. This is with TF 1.10 and keras 2.2

Posted on 2018-08-27 03:39:39
Donald Kinghorn

I'm not sure what the problem is but be sure to check that you have activated the tf-gpu environment. 60 sec seems like a long time for. It kind of sound like you may have installed tensorflow into your environment instead of tensorflow-gpu Also, the Keras update may be best to use keras-gpu ...

I would make a new environment and install tensorflow-gpu and keras-gpu in it and try it again. Hope it this works for you!

Posted on 2018-08-27 21:08:36
richard

I had the same problem (xUbuntu 18.04, GTX1050, nvidia driver 396.54)

You should have a look here: https://github.com/Continuu...
Seems that it comes from an incompatibility between "conda-forge" and "defaults" packages

What worked for me:
conda create -n env_tf-gpu -c defaults tensorflow-gpu keras-gpu ipykernel

Hope it will help

Posted on 2018-09-17 09:00:52
Lucy Nowacki

First update your drivers, next install CUDA 9.2 from nvidia, and after use this tuto. It flies. It seems that CUDA of conda is already obsolete. My spec is laptop MSIGS70 with GTX 970M on drivers 410.73

Posted on 2018-11-19 00:16:24
Renz Abergos

Running on Ubuntu 18.04 with 1070ti.

Installed nvidia driver then followed these instructions but it did not work. Got error running model.fit()

E tensorflow/core/common_runtime/direct_session.cc:158] Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Posted on 2018-08-28 09:41:31
Donald Kinghorn

I know that error too well ... This is from a runtime library mismatch. I talk about it a bit in this post https://www.pugetsystems.co... This is an annoyance with CUDA and the NVIDIA driver. It can go both ways either a too new driver or too old depending on which CUDA version you are using. It's because the CUDA runtime is provided by the driver not the cuda toolkit these need to be in sync when you are doing dev work.

Let me check the newest anaconda tensorflow-gpu ... OK, I see it is using cudatoolkit 9.2 that means that you will need an nvidia-396 driver installed on your system. You should be able to update your driver either from the "Software & Updates" tool in Ubuntu or from the nvidia-drivers ppa You shouldn't have any trouble like what I described in the post that I linked since I think those quirks are fixed now. Once you get get your driver updated (or downdraged!) you should be good to go.

Posted on 2018-08-28 15:48:13
Mauro Risonho Paula Assumpção

Great write up!

My system Ubuntu 18.04.1 x64
Nvidia Driver 390.77 / GeForce GT 740M / CUDA Cores:384 / GPU: 2048 MB / PCIe Generation:PCI Express x8 Gen2

model.fit(X_train, y_train, batch_size=128, epochs=15, verbose=1,
validation_data=(X_test,y_test), callbacks=[tensor_board])

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-accd94721056> in <module>()
----> 1 model.fit(X_train, y_train, batch_size=128, epochs=15, verbose=1,
2 validation_data=(X_test,y_test), callbacks=[tensor_board])

NameError: name 'model' is not defined

Posted on 2018-08-29 04:18:12
Adhitya Mohan

Have you tried using the latest nvidia driver 396 instead of 390?
the commands below should help you
sudo apt purge nvidia*
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt install nvidia-driver-396
and reboot and try to see if it works.

Posted on 2018-09-07 08:42:27
Donald Kinghorn

There's your comment, I didn't see it at first (except in my email notice)... I quoted it below for someone else. Thanks for posting it! I have added a paragraph near the top of the post about the runtime mismatch since the TF and Keras builds are linking to CUDA 9.2 now and that requires the runtime in the nvidia-396 driver. Best wishes --Don

Posted on 2018-09-07 17:38:12
Adhitya Mohan

Yeah I replied to the wrong comment since this a user error, the previous one was about the runtime error.

Posted on 2018-09-07 17:43:46
Mauro Risonho Paula Assumpção

Wed Aug 29 17:14:40 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77 Driver Version: 390.77 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 Off | 00000000:01:00.0 On | N/A |
| N/A 43C P8 8W / N/A | 126MiB / 6069MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|==================================================|
| 0 3646 G /usr/lib/xorg/Xorg 120MiB |
| 0 4324 G /usr/lib/firefox/firefox 3MiB |
+-----------------------------------------------------------------------------+

[I 16:49:54.120 NotebookApp] Saving file at /Projects/Ubuntu18041.ipynb
2018-08-29 16:50:07.374783: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-08-29 16:50:08.935207: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-08-29 16:50:08.936001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.733
pciBusID: 0000:01:00.0
totalMemory: 5.93GiB freeMemory: 5.69GiB
2018-08-29 16:50:08.936036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-29 16:50:08.936238: E tensorflow/core/common_runtime/direct_session.cc:158] Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Posted on 2018-08-29 20:15:20
Donald Kinghorn

This is the runtime library mismatch error. You need the nvidia-396 driver for the newer TF and Keras packages.

Adhitya Mohan has a good reply to fix this but I am not seeing his comment for some reason .... I'll just copy it here. Thank you Adhitya :-)

Have you tried using the latest nvidia driver 396 instead of 390?
the commands below should help you
sudo apt purge nvidia*
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt install nvidia-driver-396
and reboot and try to see if it works.

Posted on 2018-09-07 16:10:33
Kris

Thanks for this writeup! Regarding the driver mismatch, is there a way to request an older build of TF and Keras from the Anaconda cloud which are linked against the 390.x driver? I have a GeForce 550 Ti, which is not supported on the 396.x short lived branch. Otherwise, it is looking like I won't be able to use Anaconda?

Posted on 2018-09-08 17:10:37
Donald Kinghorn

Yes, you don't necessarily need to be using the latest version of TF. I believe it's been stable with a solid API since around 1.4 or earlier. You can use conda ...

conda search -f tensorflow-gpu

to list available package versions and the you can install a specific version with something like,

conda install tensorflow-gpu=1.8.0

That should get you the version I used originally in this post and it should work with the 390 driver.

Keep an eye out for good deals on used GPU's! The new GeFoece 20xx series should be available soon and there could a lot of people getting rid of what are really excellent cards. Anything from a GTX970 and up is really nice to work with. Best wishes --Don

Posted on 2018-09-09 17:04:40
Kris

Thank you for the reply. I installed 1.8.0, and it informed me that the GTX 550 Ti has a CUDA compute capability of 2.1, and that 3.5 or better was required. It was a budget card when I bought it long ago, so I shouldn't be surprised that it is no longer supported. Guess it is time to upgrade.
Thanks again!

Posted on 2018-09-12 02:46:55
Dima Kalika

Thank you!!!

Just a heads up - I have a gtx-970 and the instructions only worked when I replaced the tensorflow install command with: conda install tensorflow-gpu=1.8.0. Hopefully this helps someone!

Posted on 2018-09-15 20:23:29
Lucy Nowacki

Nice tutu, but problem with tensorboard

Posted on 2018-10-31 03:04:49
Donald Kinghorn

sorry you are having trouble with tensorboard ... I know it can be finicky sometimes. I always have to double check that everything is setup for it to start up correctly.

I'm thinking I should refresh this post with the latest releases of everything and see if any new "gotcha's" have slipped in. That does happen over time. I've been mostly using PyTorch recently and using TF in docker containers. I've been thinking about doing some screen-casts, redoing this post would probably be a good one!

Posted on 2018-10-31 20:08:15
Lucy Nowacki

sorted out. Simply, it's good to use tensorboard --logdir[dir with session] --host=127.0.1 or go to file and call tensorboard --logdir ./ --host=127.0.1 , instead.

Posted on 2018-11-19 00:02:48
Adrian

I've tried several times and I keep encountering

ImportError: /home/.../anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so: undefined symbol: cuDevicePrimaryCtxGetState
What could I be getting wrong?

Posted on 2018-11-03 14:47:11
Donald Kinghorn

I'm not sure but I have a couple of ideas ... first check to see that everything is OK with the driver do nvidia-smi to see that the driver is running correctly and seeing your card. Next check the links for the tf .so file, cd into that directory and do ldd libtensorflow_framework.so to check for libraries that are missing or wrong version.

That should get you some hints for what's happening. My guess is that the driver is not right somehow??? I think cuDevice... is a function in the driver's cuda runtime lib.

... I just checked the graphics-drivers ppa they have the new 410.73 driver in there now. That's needed for the new RTX cards and for CUDA 10 ... I just checked something else :-) The machine I'm on right now has the original install as described in this post with TF 1.8 I am running driver 396.54, I activated the tf-gpu env and did conda update --all which updated TF to the latest 1.11 version. It is using cuda 9.2 ... this started up and ran fine on this system. So, you don't necessarily need to update your driver beyond 396 (yet)

Posted on 2018-11-04 19:39:06
A B

I am on an optimus laptop with nvidia 1050ti and intel uhd 630. I installed the nvidia driver 410. Then followed your guide. It works, but nvidia-smi still shows "No running processes found" during the execution of the code, and 'top' in terminal shows around 750% cpu usage (8 hyperthreads I believe, on the core i5 8300H). Also, this took around 15x120seconds for completion. So, it does not seem like it is using the nvidia gpu at all. No errors either!

Can you suggest what might have gone wrong? I would like to get this running on the gpu to speed things up. I did not manually install cuda toolkit. Everything other than the nvidia driver has been installed as mentioned in your guide.

Update: I spoke too soon! All I needed to do was run "sudo prime-select nvidia" and reboot to get my gpu used by tensorflow.
Thanks a lot for the writeup. This was the easiest to follow.

Posted on 2018-11-05 16:49:39
Donald Kinghorn

OH good! Laptops can sometimes be a problem. Thanks for posting your question AND solution!

I have a gaming laptop with a 1070 that only uses the NV GPU so I don't even have a way to check problems that come up from optimus. Best wishes --Don

Posted on 2018-11-05 17:53:11
Serge Chugunin

After successful installation I tried about half of examples from
https://github.com/aymericd...

Everything works!

Except for two examples with convolutional neural networks:
- Convolutional Neural Network (notebook) (code). Build a convolutional neural network to classify MNIST digits dataset. Raw TensorFlow implementation.
- Convolutional Neural Network (tf.layers/estimator api) (notebook) (code). Use TensorFlow 'layers' and 'estimator' API to build a convolutional neural network to classify MNIST digits dataset.

Segmentation fault

(tf-gpu) alex@alex-comp:~/PycharmProjects/mnist_test$ python convolutional_network.py
WARNING:tensorflow:From convolutional_network.py:17: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-images-idx3-ubyte.gz
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpv36tw2hf
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/python/estimator/inputs/queues/feeding_queue_runner.py:62: QueueRunner.__init__ (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/python/estimator/inputs/queues/feeding_functions.py:500: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
2018-11-12 12:12:41.040365: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2018-11-12 12:12:41.216695: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-12 12:12:41.217246: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.725
pciBusID: 0000:01:00.0
totalMemory: 7.76GiB freeMemory: 7.26GiB
2018-11-12 12:12:41.217264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-11-12 12:12:41.510998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-12 12:12:41.511036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2018-11-12 12:12:41.511054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2018-11-12 12:12:41.511288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6987 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py:804: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Segmentation fault (core dumped)

Posted on 2018-11-12 08:17:34
Donald Kinghorn

Hi Serge, glad to hear you are up and running ... that Seg fault is a little disturbing. I'm not sure what could have caused it but I have had random trouble with what appear to be memory leaks in both cpu and gpu space when using Jupyter notebooks. Sometimes they are obvious by looking at memory usage. I generally just shut down the notebook, kill the kernel and start again. I'm used to this sort of thing from long years of dealing with "research code" and new hardware so I don't get too concerned. However, if you are doing "serious" work and running into trouble then look deeper.

One thing to check is the user system resource limits ... let me check on this system ... this is Ubuntu 18.04 in my user account with only the default settings.

The two most important are stack size and shm limit
kinghorn@i9:~$ ulimit -l
16384
kinghorn@i9:~$ ulimit -s
8192

Those are defaults and they are pretty small! In particular too small of a stack size can give segmentation faults for some programs. In my docker containers running on my main machine (it has 128GB memory) I set memlock to "unlimited" and stack size to "67108864"

I have a discussion about this in reference to docker configuration in
https://www.pugetsystems.co...

I may need to write up a short post on this (how to change those setting) Since systemd is controlling things like this now I'd have to look some stuff up and do some testing.

Posted on 2018-11-12 17:24:11
Serge Chugunin

Thank you, Donald
I'll try this.

Posted on 2018-11-13 17:45:46
Kapil Khanna

Hi Donald,

Great post. I am using Tensorflow-GPU and PyTorch-GPU successfully for couple weeks now. Thanks.

I have a MSI laptop with Ubuntu 18.04 and configured with Intel Graphics prior to this setup.
The Intel graphics worked better for display. e.g. Hibernate had no issues.
I googled for setting up the display with Intel graphics, while reserving Nvidia for Tensorflow-GPU with CUDA.
Haven't been able to get this working.

DETAILS
*********************
kapilok@trantor-lp98:~$ sudo lshw -C display
[sudo] password for kapilok:
*-display
description: 3D controller
product: GP107M [GeForce GTX 1050 Ti Mobile]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:01:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:147 memory:a3000000-a3ffffff memory:90000000-9fffffff memory:a0000000-a1ffffff ioport:4000(size=128) memory:a4000000-a407ffff
*-display
description: VGA compatible controller
product: Intel Corporation
vendor: Intel Corporation
physical id: 2
bus info: pci@0000:00:02.0
version: 00
width: 64 bits
clock: 33MHz
capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
configuration: driver=i915 latency=0
resources: irq:125 memory:a2000000-a2ffffff memory:80000000-8fffffff ioport:5000(size=64) memory:c0000-dffff
kapilok@trantor-lp98:~$

Please advise.
Thanks,
Kapil

Posted on 2018-12-05 06:11:23
Donald Kinghorn

Laptops can be trouble. I have a gaming laptop with a 1070 in it but it has the display connected directly to it i.e no Intel graphics ... I'm sure it is possible to use your 1050ti while the display is on Intel but your system might fight with you a bit. ???

The first thing to try is to treat it like it was "headless". Just load the nvidia drivers and see if it works (I'm assuming that the /dev nod file are there since the hardware is properly detected)

sudo modprobe nvidia
sudo modprobe nvidia_uvm

That should work ??? (I hope) do an ls /dev to see if the stuff like /dev/nvidia0 and /dev/nvidia_uvm are there. With the kernel modules loaded the tools that probe for the card should see it. Try nvidia-smi if that works then you are in business for compute since TensorFlow and PyTorch will probe for cuda devices when they start up.

Posted on 2018-12-05 16:30:43
Kapil Khanna

Thanks, Donald.
This is working for now. Laptop under test for a few days. Keep you posted.

Posted on 2018-12-09 00:52:34
Cristiano Coelho Souza

It's important to set the Python version in the command to create the conda env, because tensorflow currently only supports Python 3.6.
--> conda create --name tf-gpu python=3.6

Posted on 2018-12-06 14:43:55
Donald Kinghorn

Thank you for adding that comment! This is a problem with posts like this. Things change! I'm going to try to do a refresh on some of these "core" posts in the new year and I'll try to add warnings in places where versioning can be an issue. ... having simple fine control over that is a nice feature of using conda

Posted on 2018-12-06 17:24:27
Mileta Dulovic

Okay so. I have big problem and I wasn't able to solve it for days.
I've searched like every thread on this and still can't do it.

Firstly. I am using Ubuntu 18.04 with python 3.6 and Nvidia Gforce
920M graphic card. I have Cuda 9.0 (also tried with 9.1 and 9.2).

I was trying to install tensorflow-gpu but every time I import it I get same error.

2018-12-09 01:43:41.324778: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.
Aborted (core dumped)

I also tried with Anaconda and still same error. Any tips?

Posted on 2018-12-09 09:34:45
Mileta Dulovic

So.. switching to python 2.7 resolves this problem.. now it works great.. I really have no idea why but it works..

Posted on 2018-12-09 11:17:30
Donald Kinghorn

In interesting that you were getting a core dump! I'm not sure what's going on there. That message about AVX is normal. TF starts with a hardware capability probe and then warns if the exe wasn't compiled to use take advantage of it. I did a custom build with AVX support and it really didn't make much difference.

I'm puzzled about the python 2 vs 3 thing. I would try it again and be sure that you specify which python version during env creation ... on the other hand working with 2.7 is fine too if you are OK with that!

Posted on 2018-12-10 17:18:38
HotNews8

Thank you

Posted on 2018-12-12 18:08:05
jefflee

Hi - Do you have any idea of why this occurred? I installed cuda 10, anaconda (nvidia driver 410), and then tensorflow-gpu. But following the instruction of this article I got this error:

Train on 60000 samples, validate on 10000 samples
2018-12-12 16:03:31.294478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-12 16:03:31.294501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-12 16:03:31.294505: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-12 16:03:31.294508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-12 16:03:31.294574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9508 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
Epoch 1/15
Segmentation fault (core dumped)

Maybe the cuda version problem?

Posted on 2018-12-12 21:09:05
jefflee

alright, i figured out it was probably the incompatibility of different versions of cudnn. I created a new environment with conda specifying python=3.6 (conda create --name tf-gpu python=3.6), and then installed tensorflow-gpu=1.8.0 (conda install tensorflow-gpu=1.8.0). I'm still wondering why exactly this happened but at least now all codes on this page run smoothly.

Posted on 2018-12-12 21:48:01
Donald Kinghorn

Good, I'm glad you sorted it out. Stuff like what you saw is not uncommon. Version conflicts can be a headache. Python would be almost unusable without env's! This is a problem writing up posts like this too. Even if everything is working great when I do it initially a few months down the road things may break from version changes. I plan on a refresh of these setup posts and I'm going to try to add warnings about stuff like this

Posted on 2018-12-12 23:28:05
Bhuvanesh S K

Sir I can't believe this but it worked... I wasted time and data downloading Cuda Toolkit and stuff... This is really awesome. Thqs a lot sir

Posted on 2018-12-15 19:25:13
g_h0st

Hi Donald,

I'm running into an issue with this MNIST example. I created a stackoverflow post regarding it--do you mind looking at it?

https://stackoverflow.com/q...

Thanks

Posted on 2018-12-22 03:26:06
Donald Kinghorn

saw you comment on stackoverflow ... glad you got it sorted out ... I really need to get that refresh of this post put up. That should happen in a few weeks

Posted on 2018-12-24 18:04:20
Filip Novoselnik

Hello, I have some troubles installing tensorflow.
When I try your example I get this:

ImportError Traceback (most recent call last)
~/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py in <module>
57
---> 58 from tensorflow.python.pywrap_tensorflow_internal import *
59 from tensorflow.python.pywrap_tensorflow_internal import __version__

Do you have any ideas why this happens?
Thank you!

Posted on 2019-01-11 11:39:40
Donald Kinghorn

from what you posted I can't really tell but I have some suspicions ... There have been problems coming up from version mismatch (a big reason why I need to refresh this post)

Try making a new environment with specific version of python and the install into that with a specific version of TF

conda create --name tf-gpu-112 python=3.6
source activate tf-gpu-112
conda install tensorflow-gpu=1.12

otherwise everything else should be the same. This is a good place to start at least. You can create or delete envs and try some different versions but I think that combination with Python 3.6 and TF 1.12 is good. It should pull everything you need. You could possibly be having an issue with your NVIDIA driver too. ... hang on a sec ... tensorflow-gpu=1.12 will pull cuda 9.2 and cudnn 7.2 those should be fine with drivers back to 385 and 390 I believe. If you have an older driver than that then you may need to update that.

Another thing, be sure you set up your Jupyter notebook correctly too and that you have selected the right env. If the notebook is not loading the right env then you'll get version/missing errors when you try to load modules.

Posted on 2019-01-11 18:53:31
BadgerMan

You rock!

I am a relative newbie - a few weekends of banging my head against the wall and shazam something that works. Thank you for your help. It was tough enough to get through the NVIDIA driver installation for Ubuntu (See NVIDIA support site) only to have the Jupyter notebook session not to recognize my GPU with TensorFlow . My system recognized the NVIDIA card (G-Force 970) however Tensor flow would not recognize the GPU - I believe I followed the TensorFlow documentation to the letter. Your instructions were great, I did skip to the part where you upgraded anaconda/python etc since this was a fresh install from anaconda.

I can now focus on the learning instead of the system set up. FYI my 15th Epoch registered 95 us ( 8 core i7 / NVIDIA 970 / Ubuntu 18) compared to the 83 us your system registered with the 1080 Ti. Need to digest the differences on why your method works so well -- but really thanks again!

Posted on 2019-01-15 03:07:57
Ivan Z

Hello,
Great guide - I will make sure to try it out next week when my workstation arrives and i switch to Ubuntu.
One thing that makes me a little worried is that it will give me the same troubles as with a Win10 install...
problem is that Anaconda comes now with 2.7 or 3.7 version of Python .... Keras supports only until 3.6.
in the last couple of days I struggled to make the Keras install without breaking Anaconda.
The steps I follow are these:
- Install Anaconda with python 3.7
- update all packages
- downgrade to Python 3.6
- install Keras

It worked on one Win10 PC but still doesn't on the second PC .... Anaconda Prompt shuts down without even giving me time to Ctrl C and the navigator refuses to launch anything.

Could anybody point me to a good uptodate guide or steps to follow to install Keras and Python 3.6 using the available Anaconda downloads (Python 2.7 or 3.7)?
Help would be very much appreciated.

Ivan

Posted on 2019-02-14 11:26:14
Donald Kinghorn

Try creating a new env like this,

conda create --name tf-gpu python=3.6

That will create that particular environment with python 3.6 actually it might be good to make it more specific since the newest python 3.7.2 and 3.6.8 seem to be having trouble (broken for me with PyTorch I had to create an env with python 3.7.1 to get version 1.0.1 pytorch to work)

conda create --name tf-gpu python=3.6.7

Then do the rest like in the post. One of the great strengths with envs is you can be specific about what versions of stuff you install into them. You can create and delete them as your needs change

Posted on 2019-02-15 00:36:08
Chetak Kandaswamy

Useful installation instructions for GPU's. I have installed 4 GPU's and run the code parallel with compilation of model on CPU. The system crashes, when running on the Ubuntu 18.04 desktop edition, but runs well in Windows 10. I made a recent post in Stackoverflow: https://stackoverflow.com/q....

Where is the problem? Also is their a way to stop memory cache overload, if running the model training for more than 2 days?

Posted on 2019-02-19 15:17:40
Donald Kinghorn

I'm not really sure what is going on but you might want to check "ulimit" stuff. I talk about this a bit in one of my docker and NGC config posts
https://www.pugetsystems.co...

That might help

Another thing given the first comment on stackoverflow and the fact that you have 4 GPU's ... look at your BIOS settings and see if you have something for turning on large BAR's i.e . "Above 4G decoding". I kind of doubt that is what is causing your trouble since Win10 works OK but you never know

It sounds like you are having some kind of resource issue it could be tricky to sort out!

Posted on 2019-02-20 21:56:11
Chuck37

I never got "jupyter notebook" to work as explained in the tutorial. It complained that it there was no such file or directory "notebook". I ran "sudo apt install jupyter-notebook" outside the virtual environment and used "jupyter-notebook" instead and the rest of the tutorial worked properly.

Posted on 2019-03-10 18:16:19
Donald Kinghorn

Interesting, it sounds to me like you have something else on your PATH called jupyter that was getting hit before the Anaconda installed jupyter and it was expecting a file argument. You might want to check your user environment variables ...

Check your .bashrc file to see if the anaconda PATH is set and is correct.

If things are working OK for you and you are getting stuff done then just ignore this and get back to it :-)

Posted on 2019-03-12 00:35:14
Ian Connor

After following way too many other guides, this is the first one that works! Thanks.

Posted on 2019-03-13 03:48:20
Kapil Khanna

Donald,

I have a new MSI laptop (PS 63 Modern 8RC - 2019 model).
I've installed Ubuntu 18.04.2 LTS.
Nvidia says Ubuntu 18.04 requires CUDA 10
When I do -> conda install tensorflow-gpu, it sets up with tensorflow 1.12 & CUDA 9 and I get failures when I run tensorflow code. (different runtime version)
How can I get tensorflow setp with CUDA 10 (preferably using conda)?

Thanks,
Kapil

Posted on 2019-03-15 18:02:03
Donald Kinghorn

hummm, I'm not sure what they are referring to. TensorFlow itself is built against cuda 9 by default. You would have to compile TF from source to link it to 10 ... Oh never mind! I know what they are referring to. They didn't add 18.04 as a supported distribution until cuda 10 was released. But that doesn't mater because you don't need to install cuda at all.

If you use the tensorflow-gpu from Anaconda cloud then it will install (in the package directory) the needed libraries (cuda 9). You do need a recent nvida driver. The "graphics drivers ppa" will have the latest.

If you are getting runtime errors when you start up TF then something else is going on.

I'm at NVIDIA's GTC this week, when I get back I plan on getting back to writing more "How-To" kinds of post and it looks like I need to revisit this post ... you are not the only person having trouble and I'm not sure why. It should be simple and "just work" if you use Anaconda and conda to set up your python environments. ???
If you post an error message that you are seeing maybe myself or someone else will have some ideas of what's not working for you.

Posted on 2019-03-18 01:30:27
Donald Kinghorn

Hey Kapil, I posted a comment a minute ago and then realized what your setup actually is! You can have trouble on laptops because of the mix of NVIDIA and Intel drivers. I was running Ubuntu 18.04 on an ASUS gaming laptop with a 1070 in it. I set it up so that it only used the NVIDIA driver. Battery life was bad but it worked great. ( it was too heavy to be considered "mobile" anyway) Be sure that you have the driver installed and that it is starting up correctly "on demand". I think it will work automatically switching from Intel to NV if you start up a program that liked with cuda. (remember that the cuda runtime is from the driver not a cuda install)

Try opening a terminal and doing nvidia-smi if you have the driver installed properly and it's running the it should show you your GPU and list the driver. If it does not then that is where your problem is. Unfortunately I can't really help you too much with this. But that would be what you would need to sort out.

If you do have the driver installed and nvidia-smi is giving an error about now GPU's available the you can try loading the driver module manually.

sudo modprobe nvidia (I can't verify that right now) As long as the kernel module is loaded then TF should see the runtime and fire up. Even if you are running your display on the Intel driver!

I hope this help you!

Posted on 2019-03-18 01:56:29
Kapil Khanna

Hi Donald,

I've recently purchased a new laptop - MSI PS63 Modern 8RC (2019 model). with Nvidia GTX 1050.
Installed Ubuntu 18.04.2 LTS ... all working good.
Installed Nvidia driver - sudo apt install nvidia-driver-410 (CUDA 10) ... all well
Installed Pytorch -> conda install pytorch torchvision cudatoolkit=10.0 ... all well
Tried installing Tensorflow -> conda install tensorflow-gpu ... fails with Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
This installs Tensorflow 1.12 with CUDA 9. Cant figure how to install Tensorflow with GPU and CUDA 10

Please advise.
Thanks,
Kapil

Posted on 2019-03-19 02:43:32
Hafiz Hamza Hafeez

Dated: April 1, 2019
GPU: Nvidia GeForce MX-150
Laptop: Lenovo Ideapad-520
OS: Ubuntu 18.04

I am guessing that Anaconda has updated their default Tensorflow-GPU version to 1.12.0 which really messed things up for me as it did not add Cuda Toolkit to the path that installation expects it to be, hence CUDA can not be found, or the one installed (CUDA 9) is insufficient for the latest Tensorflow leading up to following Error
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version"
Had almost a whole day wasted in trying to resolve this.

Solution:
Use the following command

conda install -c anaconda cudatoolkit

This installs Tensorflow 1.10.0 along with CUDA 10 which works perfectly with everything. Thought this might help someone.
Edit: (This solution will only work if you already have Tensorflow 1.12.0 installed , hence downgrading the Tensorflow Version. Else the last command only installs CudaToolKit)

Posted on 2019-04-01 05:32:41
Donald Kinghorn

The message "cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version" means that your display driver is out of date (or not loaded) for the version of cuda you are trying to use.

The command you listed "conda install -c anaconda cudatoolkit" only installs cudatoolkit in the environment you are currently in, using from the "channel (-c)", "anaconda" . It doesn't install tensorflow.

You needed to update your NVIDIA display driver (or make sure it was loaded ... maybe you are using a laptop and it wasn't loaded...) Then create a new env for tensorflow and install it. All dependencies are installed with it except the "cuda runtime" which comes from your display driver installed on you system.

TensorFlow is at 12.1 for Linux right now. https://anaconda.org/anacon... It is compiled against cuda 9.2 and cudnn 7.3 ... you shouldn't have needed to do what you did??? Let me check something ... OK, I just verified what is installing along with the latest tensorflow-gpu from the default channel.

Following is a fresh install into a new env

kinghorn@i9:~$ conda create --name tf-12-test
kinghorn@i9:~$ source activate tf-12-test
(tf-12-test) kinghorn@i9:~$ conda install tensorflow-gpu keras-gpu

This pulls modules including cuda 9.2 and cudnn 7.3 along with tensorflow ... so all the dependencies are there and it seems to work fine.


(tf-12-test) kinghorn@i9:~$ conda list --name tf-12-test
# packages in environment at /home/kinghorn/anaconda3/envs/tf-12-test:
#
# Name Version Build Channel
...
cudatoolkit 9.2 0
cudnn 7.3.1 cuda9.2_0
cupti 9.2.148 0
...
...
libgcc-ng 8.2.0 hdf63c60_1
libgfortran-ng 7.3.0 hdf63c60_0
libprotobuf 3.6.1 hd408876_0
libstdcxx-ng 8.2.0 hdf63c60_1
markdown 3.0.1 py36_0
mkl 2019.3 199
mkl_fft 1.0.10 py36ha843d7b_0
mkl_random 1.0.2 py36hd81dba3_0
...
numpy 1.16.2 py36h7e9f1db_0
numpy-base 1.16.2 py36hde5b4d6_0
...
python 3.6.8 h0371630_0
...
tensorboard 1.12.2 py36he6710b0_0
tensorflow 1.12.0 gpu_py36he74679b_0
tensorflow-base 1.12.0 gpu_py36had579c0_0
tensorflow-gpu 1.12.0 h0d30ee6_0
...

>>> import tensorflow as tf
>>> tf.__version__
'1.12.0'

Posted on 2019-04-01 16:13:50
Donald Kinghorn

Hey Hafiz, I just reread your comment ... sorry, I missed your config (it was too early in the morning :-) I think the trouble you were having was laptop related. They can cause real headaches sometimes especially with Linux. That is a pretty old GPU too ... I recommend that you do a fresh display driver install using ppa:graphics-drivers/ppa and use nvidia-driver-418. The see if you can force your system to switch to nvidia before you activate a python tensorflow environment. You can always quickly create a new env to try things like I did in the comment below. That really should work for you as long as your display driver is up to date and loaded.

On the other hand if whatever you did got you going and you are writing and running code then don't mess with it! :-) sometimes you just have to grateful that anything works.
Take care my friend --Don

Posted on 2019-04-01 16:57:32
Hafiz Hamza Hafeez

Thanks for the response Donald.
And yes I did the same command for installation of CudaToolKit on Cuda Environment on a Server PC in the lab (GTX 970 - Ubuntu 16.04) and it did not install Tensorflow by default, which I will edit in the original comment so it does not misguides anyone, however, the command does downgrades Tensorflow to 1.10.0 if it is already installed on the environment and resolves the initial problem.

About the Nvidia-Driver, I lost a whole important day in getting it to work correctly on my laptop, and given some Final Year Project deadlines, its very important for me that I stick with the other part of your advice since its all working well and I am more that grateful for it :-)
I will try the newer drivers as soon as I have some time to breathe.
Thanks again and regards
Hamza

Posted on 2019-04-02 11:04:53
Mike Chen

Thanks for your installation guideline. In fact, I could not install it in Ubuntu Desktop 18.04 LTS since Any Nvidia R2060 Graphics Driver does not support gcc 7.4. However, I can install it in Ubuntu Desktop 16.04 LTS. After successful installation in Ubuntu 16.04, I have two issues as follows. Please help me solve the issue. Thanks in advance.

---------------------------------------------------------------------------

UnknownError Traceback (most recent call last)

<ipython-input-2-98af67ad6a93> in <module>

33

34 model.fit(X_train, y_train, batch_size=128, epochs=15, verbose=1,

---> 35 validation_data=(X_test,y_test), callbacks=[tensor_board])

~/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)

1037 initial_epoch=initial_epoch,

1038 steps_per_epoch=steps_per_epoch,

-> 1039 validation_steps=validation_steps)

1040

1041 def evaluate(self, x=None, y=None,

.................

~/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/tensorflow/python/client/session.py in __call__(self, *args, **kwargs)
1437 ret = tf_session.TF_SessionRunCallable(
1438 self._session._session, self._handle, args, status,
-> 1439 run_metadata_ptr)
1440 if run_metadata:
1441 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
526 None, None,
527 compat.as_text(c_api.TF_Message(self.status.status)),
--> 528 c_api.TF_GetCode(self.status.status))
529 # Delete the underlying status object from memory otherwise it stays alive
530 # as there is a reference to status from this from the traceback due to

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_3/convolution}}]]
[[{{node metrics/acc/Mean}}]]

Please help me solve the issue.

Thanks in advance.

Mike

Posted on 2019-05-20 01:53:23
Mike Chen

Dear Donald:

Thank a lot for your instruction. It seems that I have already installed it into Ubuntu 16.04(Ubuntu 18.04 has a gcc compatibility problem, so I uninstalled Ubuntu 18.04). However, I have the the error. It fiailed to get convolution algorithm since cuDNN failed to initialize. I searched the Nvidia Forum that reminded me of updating tensoflow-gpu. However, it still had the problem after updating tensorflow-gpu in the environment of tf-gpu. Please have a look at the following information.

UnknownError: Failed to get convolution algorithm

UnknownError Traceback (most recent call last)
<ipython-input-1-201d0d4524ec> in <module>
33
34 model.fit(X_train, y_train, batch_size=128, epochs=15, verbose=1,
---> 35 validation_data=(X_test,y_test), callbacks=[tensor_board])
36

~/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
1037 initial_epoch=initial_epoch,
1038 steps_per_epoch=steps_per_epoch,
-> 1039 validation_steps=validation_steps)
1040

........

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
[[{{node metrics/acc/Mean}}]]

Thanks in advance,

Mike

Posted on 2019-05-20 08:07:11
Donald Kinghorn

Hi Mike,

First, I'm a bit concerned about what you said in the comment before this one ... "Any Nvidia R2060 Graphics Driver does not support gcc 7.4" That shouldn't be the case. You may need a driver update. In this post (a year ago) they were on 396 that is way too old for any of the RTX cards. I recommend that you update your driver before you do anything else. You can stay on Ubuntu 16.04 ( but 18.04 should be fine too). The current TensorFlow in Anaconda is 1.13.1 and it's linked with cuda 10.0 which requires a driver 415 or newer, ... you want the current 430 driver ...

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-driver-430

After the driver update try the following

...
I don't think I've seen that error before ... but I do have a suggestion, try removing your tf-gpu env and then create a new one. (it looks like you may be hitting a version miss match ??)

conda remove --name tf-gpu --all
conda info --envs (to check that it is gone)
conda create --name tf-gpu

then either
source activate tf-gpu
or if you have the updated ( in my opinion broken) conda
conda activate tf-gpu

then
conda install tensorflow-gpu jupyter ipykernel

That should get you a freash TF 13.1 which includes Keras and tensorboard
You could create a new Jupyter kernel too but if the name "tf-gpu" is what you used before then it should be OK.

I think this will get you going. I have had a couple of times in the past where I've had a broken env for some reason and just recreated it for a fix rather than try to clean up the mess

Posted on 2019-05-20 16:26:57