Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1170
Dr Donald Kinghorn (Scientific Computing Advisor )

Install TensorFlow with GPU Support the Easy Way on Ubuntu 18.04 (without installing CUDA)

Written on May 25, 2018 by Dr Donald Kinghorn
Share:


TensorFlow is a very important Machine/Deep Learning framework and Ubuntu Linux is a great workstation platform for this type of work. If you are wanting to setup a workstation using Ubuntu 18.04 with CUDA GPU acceleration support for TensorFlow then this guide will hopefully help you get your machine learning environment up and running without a lot of trouble. And, you don't have to do a CUDA install!

This guide is for Ubuntu 18.04 but I will also be doing a similar post using the latest Windows 10 build.

Ubuntu 18.04 is out and in my opinion it is a big improvement over 16.04. 18.04 is the latest LTS (Long Term Support) build of Ubuntu. It will become the standard base platform for a lot of projects. There is usually some lag time before packages and projects move to a new base platform like Ubuntu 18.04, however, at this point nearly all of the projects that I care about are already supported on 18.04.

I said "nearly all" ...! Right now I have Ubuntu 18.04 running supported version of Docker, NVIDIA-docker v2, Virtualbox, Anaconda Python, etc, there is only one package that I generally install that is not (officially) supported on 18.04 yet. That one package is NVIDIA CUDA. I had waited to write anything about Ubuntu 18.04 until CUDA 9.2 was released because I was sure it would have install support for 18.04. Well, guess what, it doesn't. For Ubuntu the recent 9.2 CUDA release only has installer support for 16.04 and 17.10! I was really surprised to see that. 16.04 makes sense but 17.10 is a short term intermediate release and it is similar enough to 18.04 that I don't understand why 18.04 didn't happen. There may be something broken that they just decided to wait to fix rather than delay the 9.2 release any further. That would be understandable and reasonable.

I will do a detailed post on how to do an Ubuntu 18.04 install including an unofficial CUDA 9.2 install. In this post I am assuming you have successfully installed Ubuntu 18.04. If that is not the case then you may want to wait for my detailed install post.

If you are not doing CUDA development work then you may not need to install CUDA anyway. The focus here is to get a good GPU accelerated TensorFlow work environment up and running without a lot of fuss.

I'm adding a note here about some issues that have come up in the comments. If you see the following error when you try to run a TF or Keras job it's because your NVIDIA display driver is not new enough for the new TF and Keras builds on the Anaconda cloud i.e. they are are linking against CUDA libs that need the nvidia-396 runtime and you have the nvidia-390 runtime installed.
Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
You should update your NVIDIA driver using either the "Software and Updates" GUI tool under the "Driver" section or update the driver manually from the command line like,
sudo apt purge nvidia*
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt install nvidia-driver-396
Best wishes --Don


Python environment setup with Anaconda Python

I highly recommend you use Anaconda Python. If you need some arguments for using Python take a look at my post Should You Learn to Program with Python. For arguments on why you should use the Anaconda Python distribution see, How to Install Anaconda Python and First Steps for Linux and Windows.

Anaconda is focused toward data-science and machine learning. It installs cleanly on your system in a single directory so it doesn't make a mess in your systems application and library directories. It is also performance optimized and links important numerical packages like numpy to Intel's MKL. Most importantly for this post, it includes easily installed modules for TensorFlow that include the CUDA dependencies!

Install Anaconda Python

sha256sum Anaconda3-5.1.0-Linux-x86_64.sh
bash Anaconda3-5.1.0-Linux-x86_64.sh
  • You will be asked to accept a license agreement and then questioned about the install location. By default it will install at the top of your home directory under anaconda3. I recommend that you use that. [ If you ever want to get rid of it or reinstall you can just remove that directory.]
  • Next it will ask if you want to append the Anaconda executable directory to your PATH environment variable in .bashrc I recommend that you do that but, remember that you did. It will add something like the following at the end of your .bashrc file,
# added by Anaconda3 installer
export PATH="/home/dbk/anaconda3/bin:$PATH"
  • Then "re-source" your .bashrc file to execute that export. [ It will happen automatically on subsequent login. ]
source ~/.bashrc
  • Next you will be asked if you want to install Microsoft VSCode. VSCode is a really good editor and it is available for free on Windows, Linux and MacOS. However, if you are interested in trying it out I would recommend that you go to the VSCode website and check it out. If you you think you want to try it then go ahead and download it and install it yourself. I usually use the Atom editor which also runs on Windows, Linux and MacOS. If you are checking out editors I recommend you try both of these as well as Sublime Text. They are all great editors!
  • Check your install. If you have you sourced your .bashrc file and your PATH is correct you should see something like,
python --version

Python 3.6.4 :: Anaconda, Inc.
  • Update your base Anaconda packages. (conda is a powerful package and environment management tool for Anaconda and it's not restricted to use with just Python)
conda update conda
conda update anaconda
conda update python
conda update --all

That should bring your entire base Anaconda install up to the latest packages.

There is a GUI for Anaconda called anaconda-navigator. I personally find it distracting/confusing/annoying and prefer using conda from the command-line. Your taste may differ! ... and my opinion is subject to change if they keep improving it.


Create a Python "virtual environment" for TensorFlow using conda

You should set up an environment for TensorFlow separate from your base Anaconda environment. This keeps your base clean and will give TensorFlow a space for all of it's dependencies. It is in general good practice to keep separate environments for projects especially when they have special package dependencies.

There are many possible options when creating an environment with conda including adding packages with specific version numbers and specific Python base versions. This is sometimes useful if you want fine control and it also helps with version dependencies resolution. Here we will keep it simple and just create a named environment and then activate that environment and install the packages we want inside of that.

From a command line do,

conda create --name tf-gpu

I named the environment 'tf-gpu' but you can use any name you want.

Now activate the environment, (I'll show my full terminal prompt and output instead of just the commands)

dbk@i9:~$ source activate tf-gpu
(tf-gpu) dbk@i9:~$

You can see that my shell prompt is now preceded by the the name of the environment.

Install TensorFlow from the Anaconda Cloud Repositories

The TensorFlow documentation is in general very good but the install documentation does not present a very good way to get a setup working on a workstation.

Do not follow the install documentation from the TensorFlow site! If you do you will have a painful time getting things working and you will have a nearly impossible to maintain install setup.

There is no good reason to do an (old) CUDA install and a pip install when you are using Anaconda Python. There is an up-to-date official Anaconda package for TensorFlow with GPU acceleration that includes all of the needed CUDA dependencies and it is well optimized for performance.

Lets install TensorFlow with GPU acceleration and all of the dependencies.

(tf-gpu) dbk@i9:~$ conda install tensorflow-gpu

That's it! That's all you need to do!

Just running that one short command above gave the following list of packages to be installed. They are installed and isolated in the "tf-gpu" environment we created. There is no nasty mess on your system!

I've cut some of the packages out and just left the "most interesting" ones in this output listing.

The following NEW packages will be INSTALLED:
...
...
    cudatoolkit:       9.0-h13b8566_0         
    cudnn:             7.1.2-cuda9.0_0        
    cupti:             9.0.176-0              
...      
    intel-openmp:      2018.0.0-8             
    mkl:               2018.0.2-1             
    mkl_fft:           1.0.1-py36h3010b51_0   
    mkl_random:        1.0.1-py36h629b387_0   

    libgcc-ng:         7.2.0-hdf63c60_3       
    libgfortran-ng:    7.2.0-hdf63c60_3       
    libprotobuf:       3.5.2-h6f1eeef_0       
    libstdcxx-ng:      7.2.0-hdf63c60_3       
...    
    numpy:             1.14.3-py36hcd700cb_1  
    numpy-base:        1.14.3-py36h9be14a7_1  
...
    protobuf:          3.5.2-py36hf484d3e_0   
    python:            3.6.5-hc3d631a_2       
...     
    tensorboard:       1.8.0-py36hf484d3e_0   
    tensorflow:        1.8.0-hb11d968_0       
    tensorflow-base:   1.8.0-py36hc1a7637_0   
    tensorflow-gpu:    1.8.0-h7b35bdc_0       

You now have GPU accelerated TensorFlow 1.8, CUDA 9.0, cuDNN 7.1, Intel's MKL libraries (that are linked into numpy) and TensorBoard. Nice!


Create a Jupyter Notebook Kernel for the TensorFlow Environment

You can work with an editor and the command line and you often want to do that, but, Jupyter notebooks are great for doing machine learning development work. In order to get Jupyter notebook to work the way you want with this new TensorFlow environment you will need to add a "kernel" for it.

With your tf-gpu environment activated do,

(tf-gpu) dbk@i9:~$ conda install ipykernel

Now create the Jupyter kernel,

(tf-gpu) dbk@i9:~$ python -m ipykernel install --user --name tf-gpu --display-name "TensorFlow-GPU"

With this "tf-gpu" kernel installed, when you open a Jupyter notebook you will now have an option to to start a new notebook with this kernel.
Jupyter kernel for TF


An Example using Keras with TensorFlow Backend

In order to check everything out lets setup LeNet-5 using Keras (with our TensorFlow backend) using a Jupyter notebook with our "TensorFlow-GPU" kernel. We'll train the model on the MNIST digits data-set.

Install Keras

With the tf-gpu environment activated do,

(tf-gpu) dbk@i9:~$ conda install keras

You now have Keras installed utilizing your GPU accelerated TensorFlow. It is that easy!

Launch a Jupyter Notebook

With the tf-gpu environment activated start Jupyter,

(tf-gpu) dbk@i9:~$ jupyter notebook

From the 'New' drop-down menu select the 'TensorFlow-GPU' kernel that you added (as seen in the image in the last section). You can now start writing code!

MNIST example

Following are Python snippets you can copy into cells in your Jupyter notebook to setup and train LeNet-5 with MNIST digits data.

Import dependencies

import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.layers import Flatten,  MaxPooling2D, Conv2D
from keras.callbacks import TensorBoard

Load and process the MNIST data

(X_train,y_train), (X_test, y_test) = mnist.load_data()

X_train = X_train.reshape(60000,28,28,1).astype('float32')
X_test = X_test.reshape(10000,28,28,1).astype('float32')

X_train /= 255
X_test /= 255

n_classes = 10
y_train = keras.utils.to_categorical(y_train, n_classes)
y_test = keras.utils.to_categorical(y_test, n_classes)

Create the LeNet-5 neural network architecture

model = Sequential()
model.add(Conv2D(32, kernel_size=(3,3), activation='relu', input_shape=(28,28,1)) )
model.add(Conv2D(64, kernel_size=(3,3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))
model.add(Flatten())          
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(n_classes, activation='softmax'))

Compile the model

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Set log data to feed to TensorBoard for visual analysis

tensor_board = TensorBoard('./logs/LeNet-MNIST-1')

Train the model

model.fit(X_train, y_train, batch_size=128, epochs=15, verbose=1,
          validation_data=(X_test,y_test), callbacks=[tensor_board])

The results

After running that training for 15 epochs the last epoch gave,

Epoch 15/15
60000/60000 [==============================] - 5s 83us/step - loss: 0.0188 - acc: 0.9939 - val_loss: 0.0303 - val_acc: 0.9917

Not bad! Training accuracy 99.39% and Validation accuracy 99.17%


Look at the job run with TensorBoard

Start TensorBoard

 (tf-gpu) dbk@i9:~$ tensorboard --logdir=./logs --port 6006

It will give you an address similar to http://i9:6006 Open that in your browser and you will be greeted with (the wonderful) TensorBoard. These are the plots it had for that job run,
TensorBoard output

That was a model with 1.2 million training parameters and a dataset with 60,000 images. It took 1 minute and 9 seconds utilizing the NVIDIA GeForce 1080Ti in my system!

Happy computing! --dbk

Tags: Ubuntu 18.04, TensorFlow, CUDA, Intel MKL, Anaconda Python,
Jay Chen

Great write up!

Posted on 2018-05-29 19:27:58
Chester Parrott

# Might be obvious, but also need to install Nvidia drivers first

sudo apt-get purge nvidia*
sudo add-apt-repository ppa:graphics-drivers
sudo apt-get update
sudo apt-get install nvidia-390
sudo reboot now

Posted on 2018-06-01 23:12:51
koundy

Thanks!

Posted on 2018-06-12 11:40:44
João Devezas

I wouldn't recommend blindly installing that driver, instead it mat be easier/wiser to let ubuntu auto-install the drivers:

ubuntu-drivers devices
ubuntu-drivers autoinstall

reboot

Posted on 2018-12-09 13:59:34
Donald Kinghorn

(never blindly install anything! ... but, the graphics-drivers ppa is maintained by really good people ... I wouldn't recommend "autoinstall")

You need to know what drivers are being installed! If you want to run any code built against CUDA 10 you will need a 410+ driver to support the runtime.

... but I've noticed that the built in device detection in Ubuntu is better now and they support reasonable drivers. I think they still pull from graphics-drivers ppa though ??

I've been able to update and upgrade display drivers using the GUI tool "Softeware & Update" under the "Additional Driver" tab. So this would probably be the preferred way to do this now. You can still pick which driver you want.

Posted on 2018-12-10 17:35:17
John

Thank you so much! It is really a PAIN!!! to follow the installation guide on tensorflow.org. I wish I could find this article earlier. GREAT JOB!

Posted on 2018-06-10 07:34:32
Ji Han Heo

Thank you so much! I've been struggling with installing GPU supported tensorflow with CUDA 9.2, cuDNN 7.1.4. on 16.04 LTS. Hours and days have I poured in trying so to no avail... apparently CUDA 9.2 isn't compatible with... I don't know what. I installed bionic beaver and followed your instructions here and voila it works!!! Thank you so much for saving my time!!!!!!

Posted on 2018-06-16 23:28:39
Soumitra

I followed the steps to create the environment. When running the example, I keep on getting the following error:

TypeError: softmax() got an unexpected keyword argument 'axis'.

Also, I would like to know where is cuda installed, since I do not see anything the following folder:

/usr/local/cuda

Any help / pointer to appropriate information is much appreciated.

Posted on 2018-07-22 20:20:35
Donald Kinghorn

I'm not sure about the error about softmax There might be a missing ) somewhere ?? or may a indentation problem ?? not sure...

The CUDA libs are packaged with the tf module. In my install they are in ~/anaconda3/envs/tf-gpu/lib

Posted on 2018-08-01 21:33:58
Binh Truong

Thank you so much for creating this write up. I was following other instruction on how to build Tensorflow from source and got completely lost.

I did have to install twice because of missing Chester steps. First time, it runs correctly, but 10 times slow. After installing Nvidida drivers first, uninstalling Anaconda, reinstalling everything here, I got better runtime at ~75s with 1070Ti

Posted on 2018-08-15 02:30:27
slow_learner

Great tutorial. I didn't know setting up tensorflow was this easy!

Posted on 2018-08-16 09:32:16
Carch

Thanks, man. Is there a way to launch a separate terminal that can access the already activated virtualenv so that I could use Tensorboard --logdir=./logs --port 6006? Please instruct, thanks.

Posted on 2018-08-21 10:47:10
Donald Kinghorn

... I've been on vacation for a bit ... You should be able to just fire up another terminal and "source activate tf-gpu" to give you another shell in the env I usually do that ... I like to have at least 2 shells open for an environment. ctrl-shift-T is your friend :-)

Posted on 2018-08-27 14:39:05
Carch

I thought there is another way. Thanks, man. You are a life saver. I tried to install Tensorflow for at least two weeks a couple months ago but failed. I decided to give it another shot a week ago, and I found your post. You have my deepest gratitude!!!

Posted on 2018-08-27 22:14:50
ramrad

Thanks for the excellent blog! I was able to install tf-gpu and keras. However when I execute the MNIST example in Jupyter, it does not use the GPUs on the system. It takes over 60 sec per epoch and nvidia-smi shows that the job is not executed on the GPUS which are idle. This is with TF 1.10 and keras 2.2

Posted on 2018-08-27 03:39:39
Donald Kinghorn

I'm not sure what the problem is but be sure to check that you have activated the tf-gpu environment. 60 sec seems like a long time for. It kind of sound like you may have installed tensorflow into your environment instead of tensorflow-gpu Also, the Keras update may be best to use keras-gpu ...

I would make a new environment and install tensorflow-gpu and keras-gpu in it and try it again. Hope it this works for you!

Posted on 2018-08-27 21:08:36
richard

I had the same problem (xUbuntu 18.04, GTX1050, nvidia driver 396.54)

You should have a look here: https://github.com/Continuu...
Seems that it comes from an incompatibility between "conda-forge" and "defaults" packages

What worked for me:
conda create -n env_tf-gpu -c defaults tensorflow-gpu keras-gpu ipykernel

Hope it will help

Posted on 2018-09-17 09:00:52
Lucy Nowacki

First update your drivers, next install CUDA 9.2 from nvidia, and after use this tuto. It flies. It seems that CUDA of conda is already obsolete. My spec is laptop MSIGS70 with GTX 970M on drivers 410.73

Posted on 2018-11-19 00:16:24
Renz Abergos

Running on Ubuntu 18.04 with 1070ti.

Installed nvidia driver then followed these instructions but it did not work. Got error running model.fit()

E tensorflow/core/common_runtime/direct_session.cc:158] Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Posted on 2018-08-28 09:41:31
Donald Kinghorn

I know that error too well ... This is from a runtime library mismatch. I talk about it a bit in this post https://www.pugetsystems.co... This is an annoyance with CUDA and the NVIDIA driver. It can go both ways either a too new driver or too old depending on which CUDA version you are using. It's because the CUDA runtime is provided by the driver not the cuda toolkit these need to be in sync when you are doing dev work.

Let me check the newest anaconda tensorflow-gpu ... OK, I see it is using cudatoolkit 9.2 that means that you will need an nvidia-396 driver installed on your system. You should be able to update your driver either from the "Software & Updates" tool in Ubuntu or from the nvidia-drivers ppa You shouldn't have any trouble like what I described in the post that I linked since I think those quirks are fixed now. Once you get get your driver updated (or downdraged!) you should be good to go.

Posted on 2018-08-28 15:48:13
Mauro Risonho Paula Assumpção

Great write up!

My system Ubuntu 18.04.1 x64
Nvidia Driver 390.77 / GeForce GT 740M / CUDA Cores:384 / GPU: 2048 MB / PCIe Generation:PCI Express x8 Gen2

model.fit(X_train, y_train, batch_size=128, epochs=15, verbose=1,
validation_data=(X_test,y_test), callbacks=[tensor_board])

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-2-accd94721056> in <module>()
----> 1 model.fit(X_train, y_train, batch_size=128, epochs=15, verbose=1,
2 validation_data=(X_test,y_test), callbacks=[tensor_board])

NameError: name 'model' is not defined

Posted on 2018-08-29 04:18:12
Adhitya Mohan

Have you tried using the latest nvidia driver 396 instead of 390?
the commands below should help you
sudo apt purge nvidia*
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt install nvidia-driver-396
and reboot and try to see if it works.

Posted on 2018-09-07 08:42:27
Donald Kinghorn

There's your comment, I didn't see it at first (except in my email notice)... I quoted it below for someone else. Thanks for posting it! I have added a paragraph near the top of the post about the runtime mismatch since the TF and Keras builds are linking to CUDA 9.2 now and that requires the runtime in the nvidia-396 driver. Best wishes --Don

Posted on 2018-09-07 17:38:12
Adhitya Mohan

Yeah I replied to the wrong comment since this a user error, the previous one was about the runtime error.

Posted on 2018-09-07 17:43:46
Mauro Risonho Paula Assumpção

Wed Aug 29 17:14:40 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.77 Driver Version: 390.77 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 Off | 00000000:01:00.0 On | N/A |
| N/A 43C P8 8W / N/A | 126MiB / 6069MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|==================================================|
| 0 3646 G /usr/lib/xorg/Xorg 120MiB |
| 0 4324 G /usr/lib/firefox/firefox 3MiB |
+-----------------------------------------------------------------------------+

[I 16:49:54.120 NotebookApp] Saving file at /Projects/Ubuntu18041.ipynb
2018-08-29 16:50:07.374783: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-08-29 16:50:08.935207: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-08-29 16:50:08.936001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.733
pciBusID: 0000:01:00.0
totalMemory: 5.93GiB freeMemory: 5.69GiB
2018-08-29 16:50:08.936036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-29 16:50:08.936238: E tensorflow/core/common_runtime/direct_session.cc:158] Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Posted on 2018-08-29 20:15:20
Donald Kinghorn

This is the runtime library mismatch error. You need the nvidia-396 driver for the newer TF and Keras packages.

Adhitya Mohan has a good reply to fix this but I am not seeing his comment for some reason .... I'll just copy it here. Thank you Adhitya :-)

Have you tried using the latest nvidia driver 396 instead of 390?
the commands below should help you
sudo apt purge nvidia*
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt install nvidia-driver-396
and reboot and try to see if it works.

Posted on 2018-09-07 16:10:33
Kris

Thanks for this writeup! Regarding the driver mismatch, is there a way to request an older build of TF and Keras from the Anaconda cloud which are linked against the 390.x driver? I have a GeForce 550 Ti, which is not supported on the 396.x short lived branch. Otherwise, it is looking like I won't be able to use Anaconda?

Posted on 2018-09-08 17:10:37
Donald Kinghorn

Yes, you don't necessarily need to be using the latest version of TF. I believe it's been stable with a solid API since around 1.4 or earlier. You can use conda ...

conda search -f tensorflow-gpu

to list available package versions and the you can install a specific version with something like,

conda install tensorflow-gpu=1.8.0

That should get you the version I used originally in this post and it should work with the 390 driver.

Keep an eye out for good deals on used GPU's! The new GeFoece 20xx series should be available soon and there could a lot of people getting rid of what are really excellent cards. Anything from a GTX970 and up is really nice to work with. Best wishes --Don

Posted on 2018-09-09 17:04:40
Kris

Thank you for the reply. I installed 1.8.0, and it informed me that the GTX 550 Ti has a CUDA compute capability of 2.1, and that 3.5 or better was required. It was a budget card when I bought it long ago, so I shouldn't be surprised that it is no longer supported. Guess it is time to upgrade.
Thanks again!

Posted on 2018-09-12 02:46:55
Dima Kalika

Thank you!!!

Just a heads up - I have a gtx-970 and the instructions only worked when I replaced the tensorflow install command with: conda install tensorflow-gpu=1.8.0. Hopefully this helps someone!

Posted on 2018-09-15 20:23:29
Lucy Nowacki

Nice tutu, but problem with tensorboard

Posted on 2018-10-31 03:04:49
Donald Kinghorn

sorry you are having trouble with tensorboard ... I know it can be finicky sometimes. I always have to double check that everything is setup for it to start up correctly.

I'm thinking I should refresh this post with the latest releases of everything and see if any new "gotcha's" have slipped in. That does happen over time. I've been mostly using PyTorch recently and using TF in docker containers. I've been thinking about doing some screen-casts, redoing this post would probably be a good one!

Posted on 2018-10-31 20:08:15
Lucy Nowacki

sorted out. Simply, it's good to use tensorboard --logdir[dir with session] --host=127.0.1 or go to file and call tensorboard --logdir ./ --host=127.0.1 , instead.

Posted on 2018-11-19 00:02:48
Adrian

I've tried several times and I keep encountering

ImportError: /home/.../anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so: undefined symbol: cuDevicePrimaryCtxGetState
What could I be getting wrong?

Posted on 2018-11-03 14:47:11
Donald Kinghorn

I'm not sure but I have a couple of ideas ... first check to see that everything is OK with the driver do nvidia-smi to see that the driver is running correctly and seeing your card. Next check the links for the tf .so file, cd into that directory and do ldd libtensorflow_framework.so to check for libraries that are missing or wrong version.

That should get you some hints for what's happening. My guess is that the driver is not right somehow??? I think cuDevice... is a function in the driver's cuda runtime lib.

... I just checked the graphics-drivers ppa they have the new 410.73 driver in there now. That's needed for the new RTX cards and for CUDA 10 ... I just checked something else :-) The machine I'm on right now has the original install as described in this post with TF 1.8 I am running driver 396.54, I activated the tf-gpu env and did conda update --all which updated TF to the latest 1.11 version. It is using cuda 9.2 ... this started up and ran fine on this system. So, you don't necessarily need to update your driver beyond 396 (yet)

Posted on 2018-11-04 19:39:06
A B

I am on an optimus laptop with nvidia 1050ti and intel uhd 630. I installed the nvidia driver 410. Then followed your guide. It works, but nvidia-smi still shows "No running processes found" during the execution of the code, and 'top' in terminal shows around 750% cpu usage (8 hyperthreads I believe, on the core i5 8300H). Also, this took around 15x120seconds for completion. So, it does not seem like it is using the nvidia gpu at all. No errors either!

Can you suggest what might have gone wrong? I would like to get this running on the gpu to speed things up. I did not manually install cuda toolkit. Everything other than the nvidia driver has been installed as mentioned in your guide.

Update: I spoke too soon! All I needed to do was run "sudo prime-select nvidia" and reboot to get my gpu used by tensorflow.
Thanks a lot for the writeup. This was the easiest to follow.

Posted on 2018-11-05 16:49:39
Donald Kinghorn

OH good! Laptops can sometimes be a problem. Thanks for posting your question AND solution!

I have a gaming laptop with a 1070 that only uses the NV GPU so I don't even have a way to check problems that come up from optimus. Best wishes --Don

Posted on 2018-11-05 17:53:11
Serge Chugunin

After successful installation I tried about half of examples from
https://github.com/aymericd...

Everything works!

Except for two examples with convolutional neural networks:
- Convolutional Neural Network (notebook) (code). Build a convolutional neural network to classify MNIST digits dataset. Raw TensorFlow implementation.
- Convolutional Neural Network (tf.layers/estimator api) (notebook) (code). Use TensorFlow 'layers' and 'estimator' API to build a convolutional neural network to classify MNIST digits dataset.

Segmentation fault

(tf-gpu) alex@alex-comp:~/PycharmProjects/mnist_test$ python convolutional_network.py
WARNING:tensorflow:From convolutional_network.py:17: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-images-idx3-ubyte.gz
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpv36tw2hf
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/python/estimator/inputs/queues/feeding_queue_runner.py:62: QueueRunner.__init__ (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/python/estimator/inputs/queues/feeding_functions.py:500: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
2018-11-12 12:12:41.040365: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2018-11-12 12:12:41.216695: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-12 12:12:41.217246: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.725
pciBusID: 0000:01:00.0
totalMemory: 7.76GiB freeMemory: 7.26GiB
2018-11-12 12:12:41.217264: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-11-12 12:12:41.510998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-12 12:12:41.511036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2018-11-12 12:12:41.511054: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2018-11-12 12:12:41.511288: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6987 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
WARNING:tensorflow:From /home/alex/anaconda3/envs/tf-gpu/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py:804: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Segmentation fault (core dumped)

Posted on 2018-11-12 08:17:34
Donald Kinghorn

Hi Serge, glad to hear you are up and running ... that Seg fault is a little disturbing. I'm not sure what could have caused it but I have had random trouble with what appear to be memory leaks in both cpu and gpu space when using Jupyter notebooks. Sometimes they are obvious by looking at memory usage. I generally just shut down the notebook, kill the kernel and start again. I'm used to this sort of thing from long years of dealing with "research code" and new hardware so I don't get too concerned. However, if you are doing "serious" work and running into trouble then look deeper.

One thing to check is the user system resource limits ... let me check on this system ... this is Ubuntu 18.04 in my user account with only the default settings.

The two most important are stack size and shm limit
kinghorn@i9:~$ ulimit -l
16384
kinghorn@i9:~$ ulimit -s
8192

Those are defaults and they are pretty small! In particular too small of a stack size can give segmentation faults for some programs. In my docker containers running on my main machine (it has 128GB memory) I set memlock to "unlimited" and stack size to "67108864"

I have a discussion about this in reference to docker configuration in
https://www.pugetsystems.co...

I may need to write up a short post on this (how to change those setting) Since systemd is controlling things like this now I'd have to look some stuff up and do some testing.

Posted on 2018-11-12 17:24:11
Serge Chugunin

Thank you, Donald
I'll try this.

Posted on 2018-11-13 17:45:46
Kapil Khanna

Hi Donald,

Great post. I am using Tensorflow-GPU and PyTorch-GPU successfully for couple weeks now. Thanks.

I have a MSI laptop with Ubuntu 18.04 and configured with Intel Graphics prior to this setup.
The Intel graphics worked better for display. e.g. Hibernate had no issues.
I googled for setting up the display with Intel graphics, while reserving Nvidia for Tensorflow-GPU with CUDA.
Haven't been able to get this working.

DETAILS
*********************
kapilok@trantor-lp98:~$ sudo lshw -C display
[sudo] password for kapilok:
*-display
description: 3D controller
product: GP107M [GeForce GTX 1050 Ti Mobile]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:01:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress bus_master cap_list rom
configuration: driver=nvidia latency=0
resources: irq:147 memory:a3000000-a3ffffff memory:90000000-9fffffff memory:a0000000-a1ffffff ioport:4000(size=128) memory:a4000000-a407ffff
*-display
description: VGA compatible controller
product: Intel Corporation
vendor: Intel Corporation
physical id: 2
bus info: pci@0000:00:02.0
version: 00
width: 64 bits
clock: 33MHz
capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
configuration: driver=i915 latency=0
resources: irq:125 memory:a2000000-a2ffffff memory:80000000-8fffffff ioport:5000(size=64) memory:c0000-dffff
kapilok@trantor-lp98:~$

Please advise.
Thanks,
Kapil

Posted on 2018-12-05 06:11:23
Donald Kinghorn

Laptops can be trouble. I have a gaming laptop with a 1070 in it but it has the display connected directly to it i.e no Intel graphics ... I'm sure it is possible to use your 1050ti while the display is on Intel but your system might fight with you a bit. ???

The first thing to try is to treat it like it was "headless". Just load the nvidia drivers and see if it works (I'm assuming that the /dev nod file are there since the hardware is properly detected)

sudo modprobe nvidia
sudo modprobe nvidia_uvm

That should work ??? (I hope) do an ls /dev to see if the stuff like /dev/nvidia0 and /dev/nvidia_uvm are there. With the kernel modules loaded the tools that probe for the card should see it. Try nvidia-smi if that works then you are in business for compute since TensorFlow and PyTorch will probe for cuda devices when they start up.

Posted on 2018-12-05 16:30:43
Kapil Khanna

Thanks, Donald.
This is working for now. Laptop under test for a few days. Keep you posted.

Posted on 2018-12-09 00:52:34
Cristiano Coelho Souza

It's important to set the Python version in the command to create the conda env, because tensorflow currently only supports Python 3.6.
--> conda create --name tf-gpu python=3.6

Posted on 2018-12-06 14:43:55
Donald Kinghorn

Thank you for adding that comment! This is a problem with posts like this. Things change! I'm going to try to do a refresh on some of these "core" posts in the new year and I'll try to add warnings in places where versioning can be an issue. ... having simple fine control over that is a nice feature of using conda

Posted on 2018-12-06 17:24:27
Mileta Dulovic

Okay so. I have big problem and I wasn't able to solve it for days.
I've searched like every thread on this and still can't do it.

Firstly. I am using Ubuntu 18.04 with python 3.6 and Nvidia Gforce
920M graphic card. I have Cuda 9.0 (also tried with 9.1 and 9.2).

I was trying to install tensorflow-gpu but every time I import it I get same error.

2018-12-09 01:43:41.324778: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.
Aborted (core dumped)

I also tried with Anaconda and still same error. Any tips?

Posted on 2018-12-09 09:34:45
Mileta Dulovic

So.. switching to python 2.7 resolves this problem.. now it works great.. I really have no idea why but it works..

Posted on 2018-12-09 11:17:30
Donald Kinghorn

In interesting that you were getting a core dump! I'm not sure what's going on there. That message about AVX is normal. TF starts with a hardware capability probe and then warns if the exe wasn't compiled to use take advantage of it. I did a custom build with AVX support and it really didn't make much difference.

I'm puzzled about the python 2 vs 3 thing. I would try it again and be sure that you specify which python version during env creation ... on the other hand working with 2.7 is fine too if you are OK with that!

Posted on 2018-12-10 17:18:38
HotNews8

Thank you

Posted on 2018-12-12 18:08:05
jefflee

Hi - Do you have any idea of why this occurred? I installed cuda 10, anaconda (nvidia driver 410), and then tensorflow-gpu. But following the instruction of this article I got this error:

Train on 60000 samples, validate on 10000 samples
2018-12-12 16:03:31.294478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-12 16:03:31.294501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-12 16:03:31.294505: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-12 16:03:31.294508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-12 16:03:31.294574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9508 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
Epoch 1/15
Segmentation fault (core dumped)

Maybe the cuda version problem?

Posted on 2018-12-12 21:09:05
jefflee

alright, i figured out it was probably the incompatibility of different versions of cudnn. I created a new environment with conda specifying python=3.6 (conda create --name tf-gpu python=3.6), and then installed tensorflow-gpu=1.8.0 (conda install tensorflow-gpu=1.8.0). I'm still wondering why exactly this happened but at least now all codes on this page run smoothly.

Posted on 2018-12-12 21:48:01
Donald Kinghorn

Good, I'm glad you sorted it out. Stuff like what you saw is not uncommon. Version conflicts can be a headache. Python would be almost unusable without env's! This is a problem writing up posts like this too. Even if everything is working great when I do it initially a few months down the road things may break from version changes. I plan on a refresh of these setup posts and I'm going to try to add warnings about stuff like this

Posted on 2018-12-12 23:28:05