Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1419
Dr Donald Kinghorn (Scientific Computing Advisor )

How to Install TensorFlow with GPU Support on Windows 10 (Without Installing CUDA) UPDATED!

Written on April 26, 2019 by Dr Donald Kinghorn
Share:

Introduction

In June of 2018 I wrote a post titled The Best Way to Install TensorFlow with GPU Support on Windows 10 (Without Installing CUDA). That post has served many individuals as guide for getting a good GPU accelerated TensorFlow work environment running on Windows 10 without needless installation complexity. It's very satisfying to me personally to have been able to help so many to get started with TensorFlow on Windows 10! However, that guide is nearly a year old now and has needed an update for some time. I've been promising to do this in my comment reply's, so, here it is.

This post will guide you through a relatively simple setup for a good GPU accelerated work environment with TensorFlow (with Keras and Jupyter notebook) on Windows 10. You will not need to install CUDA for this!

I'll walk you through the best way I have found so far to get a good TensorFlow work environment on Windows 10 including GPU acceleration. This will be a complete working environment including,

  • System preparation and NVIDIA driver update
  • Anaconda Python installation
  • Creating an environment for your TensorFlow configuration using "conda"
  • Installing the latest stable build of TensorFlow in that environment
  • Setting up Jupyter Notebook to work with your new "env"
  • An example deep learning problem using TensorFlow with GPU acceleration, Keras, Jupyter Notebook, and TensorBoard visualization.

Lets do it.


Step 1) System Preparation - NVIDIA Driver Update and checking your PATH variable (Possible "Gotchas")

This is a step that was left out of the original post and the issues presented here were the source of most difficulties that people had with the old post. The current state of your Windows 10 configuration may cause difficulties. I'll try to give guidance on things to look out for.

The primary testing for this post is on a fresh install of Windows 10 Home "October 2018 Update" on older hardware. (Intel Core i7 4770 + NVIDIA GTX 980 GPU). This turns out to be a good test systems because it would have failed with the old guide without the information in this step.

Check your NVIDIA Driver

This is important and I'll show you why.

Don't assume Microsoft gave you the latest NVIDIA driver! Check it and update if there is a newer version.

Right click on your desktop and then "NVIDIA Control Panel"

nvidia control panel 1

nvidia control panel 2

You can see that my fresh install of Windows 10 gave me a version 388 driver. That is way too old! Now click on "System Information" and then the "Components" panel. The next image shows why that 388 driver wont work with the newest TensorFlow,

nvidia control panel 3

The CUDA "runtime" is part of the NVIDIA driver. The CUDA runtime version has to support the version of CUDA you are using for any special software like TensorFlow that will be linking to other CUDA libraries (DLL's). As of this writing TensorFlow (v1.13) is linking to CUDA 10.0. The runtime has to be as new, or newer, than the extra CUDA libraries you need.

Update the NVIDIA Display Driver

Even if you think you have the latest NVIDIA driver check to be sure.

Go to [https://www.nvidia.com/Download/index.aspx] and enter the information for your GPU. Then click "search".

driver page 1

Click "search" to go to the download page,

driver page 2

It doesn't matter too much what GPU you put in on the search page the latest driver supports cards all the way back to the 600 series.

Download and install the driver following the prompts.

Note: I used the "Standard" driver if you are using an install that was done by Dell or HP etc. they may have put there own OEM version on your system. If the standard driver doesn't work try the "DCH" driver. Also, NVIDIA now has 2 drivers because some video processing applications were not working right. I used the "Game Ready Driver". After all, it's "Workstation by day, Battle-station by night". Right?

Check your PATH environment variable

This may not be something you think about very often, but it's a good idea to have an idea of the state of your PATH environment variable. Why? Development tools will often alter you PATH variable. If you are trying to run some code and getting errors that some library or executable cannot be found, or just having strange problems that doesn't seem to make sense, then your system may be grabbing something by looking at your PATH and finding a version that you are not expecting.

If you answer yes to any of the following then you should really look at your PATH,

  • Have you installed Visual Studio?
  • Did you install some version of CUDA?
  • Have you installed Python.org Python?
  • Have you tried a "pip" install of TensorFlow?

You may be reading this because you tried and failed to install TensorFlow following Google's instructions. If you feel that you made a mess on your system then you can try to do some clean-up by uninstalling what you did. But, you may not have to clean up. Try to do what I suggest for the TensorFlow install. However, first look at your PATH so you know it's state in case you run into strange errors.

Go to the "Start menu" and start typing PATH Variable, your should get a search result for the control panel "System Properties" advanced panel.

control panel path

Click on "Environment Variables"

control panel sys

The PATH on my testing system is short because I haven't installed anything that would modify it.

If you have a long string then there is a great "Edit.." panel that will show you each entry and allow you to move things up or down and delete or add new entries.

The main idea to keep in mind is that when your systems searches for an executable or library it will start by looking in the current directory (folder) and then goes through directories listed in your User PATH entries followed by the System PATH. It keeps going until it finds the first thing that satisfies what you asked for (or fails) ... but it might not be the thing you want it to find. It takes the first thing it finds. If you have folder entries in your PATH that have different version of an executable of DLL with the same name you can move the PATH for the one you want toward the beginning of your PATH so it's found first.

Be very careful with your PATH. Don't make changes unless you know what you are doing. It should mostly be something that you are aware of for trouble-shooting.

A special note for laptops

If you have a laptop with an NVIDIA GPU (like a nice gaming laptop) then you should succeed with the instructions in this post. However, one unique problem on laptops is that you will likely have power saving control that switches your display driver back to the CPU's integrated display. A current Windows 10 setup on your laptop along with the latest driver should automatically switch your display to the NVIDIA driver when you start TensorFlow (same as starting up a game) but, if you have trouble that looks like TensorFlow is not finding your GPU then you may need to manually switch your display. You will likely find options by right clicking on your desktop.


Step 2) Python Environment Setup with Anaconda Python

I highly recommend Anaconda Python. If you need some arguments for using Python take a look at my post Should You Learn to Program with Python. For arguments on why you should use the Anaconda Python distribution see, How to Install Anaconda Python and First Steps for Linux and Windows. Another reason for using Anaconda Python in the context of installing GPU accelerated TensorFlow is that by doing so you will not have to do a CUDA install on your system.

Anaconda is focused toward data-science and machine learning and scientific computing. It installs cleanly on your system in a single directory so it doesn't make a mess in your systems application and library directories. It is also performance optimized for important numerical packages like numpy, scipy etc..

Download and Install Anaconda Python

Anaconda download

You can download an "Run" at the same time or download to your machine and double click on the "exe" file to start the installer.

  • You will be asked to accept a license agreement ...
  • "Select Install Type" I recommend you chose "Just Me" since this is part of your personal development environment.
  • "Chose Install Location" I recommend you keep the default which is at the top level of you user directory.
  • "Advanced Installation Options"

Advanced install opts

"Register Anaconda as my default Python 3.7" is recommended." "Add Anaconda to my PATH environment variable" is OK to select. However, you don't really need to do that. If you use the GUI, Anaconda Navigator, the (DOS) shell or the PowerShell link in the Anaconda folder on your start menu they will temporarily set the proper PATH environment for you without making a "permanent" change to your PATH variable. For this install I will leave it un-checked.

My personal preference it to "Add Anaconda to my PATH" because I want it to be found whenever I use Python.

Note: This version of the Anaconda distribution supports "Python environments" in PowerShell which is my personal preferred way to to work with "conda" on Windows.

Check and Update your Anaconda Python Install

Go to the "Start menu" find the "Anaconda3" item and then click on the "Anaconda Powershell Prompt",

Powershell prompt for Anaconda

With "Anaconda Powershell" opened do a quick check to see that you now have Anaconda3 Python 3.7 as your default Python.

(base) PS>python
Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>

Type CTRL-D to exit the Python prompt.

Update your base Anaconda packages

`conda` is a powerful package and environment management tool for Anaconda. We'll use `conda` from Powershell to update the base Python install. Run the following commands. It may take some time to do this since there may be a lot of modules to update.

conda update conda

conda update anaconda

conda update python

conda update --all

That should bring your entire base Anaconda install up to the latest packages. (Everything may already be up to date.)

Anaconda Navigator

There is a GUI for Anaconda called `anaconda-navigator`. I personally find it distracting/confusing/annoying and prefer using `conda` from the command-line. Your taste may differ! ... and my opinion is subject to change if they keep improving it. If you are new to Anaconda then I highly recommend that you read up on `conda` even (or especially!) if you are thinking about using the "Navigator" GUI.


Step 3) Create a Python "virtual environment" for TensorFlow using conda

You should set up an environment for TensorFlow separate from your base Anaconda Python environment. This keeps your base clean and will give TensorFlow a space for all of it's dependencies. It is in general good practice to keep separate environments for projects especially when they have special package dependencies. Think of it as a separate "name-space" for your project.

There are many possible options when creating an environment with conda including adding packages with specific version numbers and specific Python base versions. This is sometimes useful if you want fine control and it also helps with version dependency resolution. Here we will keep it simple and just create a named environment, then activate that environment and install the packages we want inside of that.

  • From the "Anaconda Powershell Prompt" command line do,
conda create --name tf-gpu

I named the environment 'tf-gpu' but you can use any name you want. For example you could add the version number.

NOTE: avoid using spaces in names! Python will not handle that well and you could get get strange errors. "-" and "_" are fine. (Python programmers often use underscores.)

  • Now exit from the Powershell you are using and then open a new one before you activate the new "env". This is an annoying quirk but, powershell will not re-read it's environment until you restart it. If you activate the new "env" before you restart you will not be able to do any package installs because the needed utilities will not be on the path in the current shell until after a restart.
  • "activate" the environment, (I'll show my full Powershell prompt and output instead of just the commands)
(base) PS C:Usersdon> conda info --envs
# conda environments:
#
base * C:UsersdonAnaconda3
tf-gpu C:UsersdonAnaconda3envstf-gpu


(base) PS C:Usersdon> conda activate tf-gpu

(tf-gpu) PS C:Usersdon>

The `conda info --envs` command shows the "envs" you have available.

After doing `conda activate tf-gpu` you can see that the prompt is now preceded by the the name of the environment `(tf-gpu)`. Any conda package installs will now be local to this environment.


Step 4) Install TensorFlow-GPU from the Anaconda Cloud Repositories

There is an "official" Anaconda maintained TensorFlow-GPU package for Windows 10!

A search for "tensorflow" on the Anaconda Cloud will list the available packages from Anaconda and the community. There is a package "anaconda / tensorflow-gpu 1.13.1" listed near the top that has builds for Linux and Windows. This is what we will be installing from the commands below.

This command will install the latest stable version of TensorFlow with GPU acceleration in this conda environment. (It will be the latest version maintained by the Anaconda team and may lag by a few weeks from any fresh release from Google.)

(tf-gpu) C:Usersdon> conda install tensorflow-gpu

That's it! You now have TensorFlow with NVIDIA CUDA GPU support!

This includes, TensorFlow, Keras, TensorBoard, CUDA 10.0 toolkit, cuDNN 7.3 along with all of the dependencies. It's all in your new "tf-gpu" env ready to use and isolated from other env's or packages on your system.


Step 5) Simple check to see that TensorFlow is working with your GPU

You can use the powershell that you have activated the tf-gpu env in and did the TensorFlow install with or open a new one and do ` conda activate tf-gpu`.

With your tf-gpu env active type the following,

python

Your prompt will change to the python interpreter prompt. this will be a simple test and we'll use a nice feature of recent TensorFlow releases, eager execution.

>>> import tensorflow as tf

>>> tf.enable_eager_execution()

>>> print( tf.constant('Hello from TensorFlow ' + tf.__version__) )

(that is 2 underscores before and after "version")

My session including the output looked like this,

(base) PS>conda activate tf-gpu

(tf-gpu) PS>python

Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32

Type "help", "copyright", "credits" or "license" for more information.

>>> import tensorflow as tf

>>> tf.enable_eager_execution()

>>> print( tf.constant( 'Hellow from TensorFlow ' + tf.__version__ ) )

2019-04-24 18:08:58.248433: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2

2019-04-24 18:08:58.488035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:

name: GeForce GTX 980 major: 5 minor: 2 memoryClockRate(GHz): 1.2785

pciBusID: 0000:01:00.0

totalMemory: 4.00GiB freeMemory: 3.30GiB

2019-04-24 18:08:58.496081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0

2019-04-24 18:08:58.947914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-04-24 18:08:58.951226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0

2019-04-24 18:08:58.953130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N

2019-04-24 18:08:58.955149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3005 MB memory) -> physical GPU (device: 0, name: GeForce GTX 980, pci bus id: 0000:01:00.0, compute capability: 5.2)

tf.Tensor(b'Hellow from TensorFlow 1.13.1', shape=(), dtype=string)

>>>

When you first run TensorFlow it outputs a bunch of information about the execution environment it is in. You can see that it found the GTX 980 in this system and added it as an execution device.

Next we will do something a little more useful and fun with Keras, after we configure Jupyter notebook to use our 'tf-gpu' environment.


Step 6) Create a Jupyter Notebook Kernel for the TensorFlow Environment

You can work with an editor and the command line and you often want to do that but, Jupyter notebooks are great for doing machine learning development work. In order to get Jupyter notebook to work the way you want with this new TensorFlow environment you will need to add a "kernel" for it.

With your tf-gpu environment activated do,

conda install ipykernel jupyter

Note: I installed both ipykernel and jupyter above since jupyter was not installed by default when we created the tf-gpu env. jupyter is installed by default in the (base) env.

Now create the Jupyter kernel,

python -m ipykernel install --user --name tf-gpu --display-name "TensorFlow-GPU-1.13"

You can set the "display-name" to anything you like. I included the version number here.

With this "tf-gpu" kernel installed, when you start Jupyter notebook you will now have an option to to open a new notebook using this kernel.

Start a Jupyter notebook,

jupyter notebook

Look at the "New" menu,

Jupyter kernel for TF

Note: If you start a jupyter notebook from the (base) env you will see "TensorFlow-GPU-1.13" option but you will not be able to import tensorflow in that notebook because TensorFlow is only installed in the "tf-gpu" env. [You could have installed into your (base) env but, I recommend that you keep separate env's.]


Step 7) An Example Convolution Neural Network training using Keras with TensorFlow

In order to check everything out lets setup the classic neural network LeNet-5 using Keras using a Jupyter notebook with our "TensorFlow-GPU-1.13" kernel. We'll train the model on the MNIST digits data-set and then use TensorBoard to look at some plots of the job run.

You do not need to install Keras or TensorBoard separately since they are now included with the TensorFlow install.

Activate your "tf-gpu" env

Launch "Anaconda Powershell" and then do,

conda activate tf-gpu

Create a working directory (and log directory for TensorBoard)

I like to have a directory called "projects" in my user home directory. In the project directory I create directories for things I'm working on. Of course, you can organize your work however you like. ... But I do highly recommend that you learn to use the command-line if your are not familiar with working like that. You can thank me later!

In powershell the the following commands are useful for managing directories,

To see what directory you are in,

pwd

(if you just opened "Anaconda Powershell" you should be in your "user home directory")

To create a new directory (and additional subdirectories all at once)

Note: when you are working with "code" I highly recommend that you **do not use spaces in directory or file names**.

# in the new version 1.14 you no longer need to create the logs file for Tensorboard
# It is still good to create a working directory 
# mkdir projects/tf-gpu-MNIST/logs
mkdir projects/tf-gpu-MNIST

That one command above gives you a work directory, "tf-gpu-MNIST", and a "logs" subdirectory.

Note: In powershell you can use "/" or "" to separate directories. (It has many commands that would be the same in Linux and you can use those alternatively to "DOS" like commands. )

To change directory use "cd"

cd projects/tf-gpu-MNIST

(For completeness) To delete a directory you can use the ` rmdir` command

IMPORTANT!

***********************************************************

The older version (1.13.1) was able to use UNIX like file paths on Windows but it looks like version 1.14 does not! You need to change this,

tensor_board = tf.keras.callbacks.TensorBoard('./logs/LeNet-MNIST-1')

to this,

tensor_board = tf.keras.callbacks.TensorBoard('.\logs\LeNet-MNIST-1') 

I also noticed that you no longer need to create the directory before hand i.e. if the directors .\logs\LeNet=MNIST-1 doesn't exist when you start the job run it will be created automatically.

*************************************************************

Launch a Jupyter Notebook

After "cd'ing: into your working directory and with the tf-gpu environment activated start a Jupyter notebook,

jupyter notebook

From the 'New' drop-down menu select the 'TensorFlow-GPU-1.13' kernel that you added (as seen in the image in the last section). You can now start writing code!


MNIST hand written digits example

The following "code blocks" can be treated as jupyter notebook "Cells". You can type them in (recommended for practice) or cut and past. To execute the code in a cell use `Shift-Return`.

We will setup and train LeNet-5 with the MNIST handwritten digits data.

Import TensorFlow

import tensorflow as tf

Load and process the MNIST data

mnist = tf.keras.datasets.mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
# reshape and rescale data for the CNN

train_images = train_images.reshape(60000, 28, 28, 1)

test_images = test_images.reshape(10000, 28, 28, 1)

train_images, test_images = train_images/255, test_images/255

Create the LeNet-5 convolution neural network architecture

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation='softmax')
])

Compile the model

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Set log data to feed to TensorBoard for visual analysis

tensor_board = tf.keras.callbacks.TensorBoard('./logs/LeNet-MNIST-1')

Train the model (with timing)

import time

start_time=time.time()

model.fit(train_images, train_labels, batch_size=128, epochs=15, verbose=1,
         validation_data=(test_images, test_labels), callbacks=[tensor_board])

print('Training took {} seconds'.format(time.time()-start_time))

The results

After running that training for 15 epochs the last epoch gave,

Train on 60000 samples, validate on 10000 samples
Epoch 1/15
60000/60000 [==============================] - 6s 105us/sample - loss: 0.2400 - acc: 0.9276 - val_loss: 0.0515 - val_acc: 0.9820
...
...
Epoch 15/15
60000/60000 [==============================] - 5s 84us/sample - loss: 0.0184 - acc: 0.9937 - val_loss: 0.0288 - val_acc: 0.9913
Training took 79.47694969177246 seconds

Not bad! Training accuracy 99.37% and Validation accuracy 99.13%

It took about 80 seconds on my old Intel i7-4770 box with an NVIDIA GTX 980 GPU (it's about 17 times slower on the CPU).


Look at the job run with TensorBoard

Open another "Anaconda Powershell" and activate your tf-gpu env, and "cd" to your working directory,

conda activate tf-gpu

cd projects/tf-gpu-MNIST

Then startup TensorBoard

tensorboard --logdir=./logs --port 6006

It will give you a local web address with the name of your computer (like the lovely name I got from this test Win10 install)

tensorboard start

Open that address in your browser and you will be greeted with (the wonderful) TensorBoard. These are the plots it had for that job run,

Note: on Chrome I had to use localhost:6006 instead of the address returned from Tensorboard

TensorBoard output

Note: For a long training job you can run TensorBoard on a log file during the training. It will monitor the log file and let your refresh the plots as it progresses.


Conclusion

That MNIST digits training example was a model with 1.2 million training parameters and a dataset with 60,000 images. **It took 80 seconds utilizing the NVIDIA GTX 980 on my old test system! For reference it took 1345 seconds using all cores at 100% on the Intel i7-4770 CPU in that machine. That's an 17 fold speedup on the GPU. That's why you use GPU's for this stuff!**

Note: I used the same procedure for doing the CPU version. I created a new "env" naming it "tf-CPU" and installed the CPU only version of TensorFlow i.e. `conda install tensorflow` without the "-gpu" part. I then ran the same Jupyter notebook using a "kernel" created for that env.

I sincerely hope this guide helps get you up-and-running with TensorFlow. Feel free to add comments if you have any trouble. Either myself or someone else in the community will likely be able to help you!

Happy computing! --dbk


Looking for a
Scientific Compute System?

Do you have a project that needs serious compute power, and you don't know where to turn? Puget Systems offers a range of HPC workstations and servers tailored for both CPU and GPU workloads.

Why Choose Puget Systems?


Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time of 7-10 business days on nearly all our system orders.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Tags: Windows, Machine Learning, Tensorflow, NVIDIA, GPU
Bernardo Rufino

Thank you for such a detailed article!

It helped me a lot not only for installing tensorflow with GPU support, but also do see how great it's to use the Shell when you know the commands!

Thanks!

Posted on 2019-05-03 02:48:21
Donald Kinghorn

You are welcome. I really like where things are headed with Powershell (and WSL) on Windows 10. I was happy to see that Anaconda has full support for PS now. The command-line will give you great powers :-)

It's even getting to where you can go back and forth between Windows and Linux pretty seamlessly. I have SSH, both client and server, running on Win 10. That gives some very interesting usage possibilities in a heterogeneous environment.

I'll be doing more posts about this kind of thing.

Posted on 2019-05-03 15:17:05
dt

There is a missing reference to the TensorBoard callback in the MNIST example. I modified my notebook in this part to look like this

import time

tbCallBack = tf.keras.callbacks.TensorBoard(log_dir='./Graph', histogram_freq=0,

write_graph=True, write_images=True)

start_time=time.time()

model.fit(train_images, train_labels, batch_size=128, epochs=15, verbose=1,

validation_data=(test_images, test_labels), callbacks=[tbCallBack])

print('Training took {} seconds'.format(time.time()-start_time))

Posted on 2019-05-03 05:43:40
Donald Kinghorn

I think you just missed it :-) I have this in there between the compile and fit

tensor_board = tf.keras.callbacks.TensorBoard('./logs/LeNet-MNIST-1')

but I only set the log directory and left everything else as defaults

There are a whole bunch of options when you instantiate a TensorBoard callback. The args you passed made me go look at the new docs (which I hadn't done yet :-)
https://www.tensorflow.org/...

Thanks --Don

Posted on 2019-05-03 14:23:31
Adam Kowalczewski

This is a fantastic alternative to any other instructions I've seen. Got me up and running with a gpu in the most direct way. Thank you so much.

Posted on 2019-05-11 21:41:56
Werner van Waesberghe

Many thanks. First test runs approx. 15 times faster.

Posted on 2019-05-11 23:53:42
Franz Hahn

Unfortunately I cannot get Tensorboard to work. The process runs, the python.exe in my tf-gpu env is allowed through my firewall, but when I open the tensorboard url I get an ERR_CONNECTION_REFUSED. Any ideas?

Posted on 2019-05-17 14:12:18
Florian Bautry

Try to go directly to localhost:6006 in your navigator.

Posted on 2019-05-24 03:54:34
Franz Hahn

I did, and I also ran the tensor board server on other ips to see if it was an IP conflict. Not the case, doesn't work on any IP.

Posted on 2019-05-24 06:33:10
Pranabesh Das

This one (http://localhost:6006/#scalars) worked for me. Thanks!

Posted on 2019-06-08 05:22:17
Donald Kinghorn

Tensorboard sometimes gives people trouble. I'm not completely sure why.

The two things that I believe cause the most trouble are,
1) the directory path to the log file that the "callback" needs to write the data to is not correct somewhere in your code and
2) Tensorboard permissions to read that data.

(I've messed up with tensorboard before by using "logs" in one place and "log" in another and not catching it)

I had some strange trouble myself last weekend (not tensorbaord related) on a new Win10 install that had oneDrive enabled. I was creating python directories and files with Powershell and they disappeared within file explorer!! (and didn't show up in oneDrive either) I suspect that sometimes oneDrive can do unexpected things on your system and could cause difficulty for programs like Tensorboard ??? It's possible that your "conection refused" error is because the log file isn't where you think it is and access to it is restricted because of one Drive (applications have to be registered to access those files (?)) [ I lost some of my work! I got so frustrated with it that I reinstalled the system and left oneDrive disabled ]

My advise is to double check that your "logs" directory is where you think it is and then make sure that oneDrive has nothing to do with the directories you are using.
Another thing that might help is to create a directory in your user directory ( I use "projects") and do all of your work in there (sub-directories). ... and if in doubt give full paths to file/dir names i.e. C:\Users\don\projects\TFtest1\logs

Posted on 2019-05-24 16:11:22
Dinesh Muniandy

Hi Donald,

Just a question, since this article doesn't focus on installing CUDA (or it's libraries) to use Tensorflow GPU with; are we going to be experiencing any form of lost in performance - meaning, the speed/efficiency that we usually get while training models (with CUDA) ?

Update 1:
* mkdir projects/tf-gpu-MNIST/logs (didn't work for me)
* mkdir projects\tf-gpu-MNIST\logs (changing this to backslash, worked for me)

Posted on 2019-05-18 11:28:04
Donald Kinghorn

Hi Dinesh, Yes! the mkdir command is the same in CMD and in Powershell but in Powershell you can use either directory separator.

Also, the cuda toolkit cuBLAS and cuDNN that get installed with the conda command are the same packages (libraries DLL's) that would be installed from a direct system wide install. But they are localized and the correct versions for the particular build of TF

Posted on 2019-05-24 15:33:25
S A

This was a great article. It is tough to find guides this detailed that review all the potential pitfalls as well.

Out of curiosity, could you deploy your Jupyter notebook onto Google Colab? Especially if you wanted to make your model available for inference by others? Have you had any experience using NiftyNet or similar packages within this environment?

Thanks!

Posted on 2019-05-26 01:22:44
Donald Kinghorn

Thanks! covering the pitfalls was part of my motivation for doing this post. The one last year left out just enough that a lot of people ran into trouble.

Colab is really nice and I wouldn't be surprised to see this MNIST with LeNet example already up there as an example.

NiftyNet looks really interesting, I hadn't seen that before.

Something I looked at today that got me motivated was the recently released TensorFlow-graphics https://github.com/tensorfl... It's unsupervised learning with training simultaneously on an encoder and decoder for images ... based on "differentiable graphics" ... There are some tutorials on Colab that are linked from the github page. I think I want to dig into this one a bit :-)

Posted on 2019-05-29 01:56:54
Prachi Sharma

This is the best post ! Everything is clearly written here. :)

Posted on 2019-06-04 13:52:00
Donald Kinghorn

Tank you :-)

Posted on 2019-06-05 16:30:50
Pranabesh Das

Great post! Thanks a ton.

Posted on 2019-06-08 05:23:15
BobVan

This is terrific. I spent the past two days trying to set up a new laptop with an RTX 2060 card installed. Your instructions worked perfectly.

THANK YOU!!!!!!!!

Posted on 2019-06-10 02:13:15
Prachi Sharma

Hi! I did the same way many days back. Everything was working fine whenever I was doing only transfer learning but when I started doing training using CNNs, this error is shown on notebook:

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node block1_conv1/convolution}}]]

I am not able to solve it. If you have any suggestions to solve it, then do let me know.

Specs: Windows 10, tensorflow-gpu = 10.0.0, CUDA = 10.0.0, Cudnn = 7.7.0, Builtin GPU = GeForce GTX 1060 3GB, Compute capability = 6.1

Thanks in Advance !

Posted on 2019-06-13 14:19:30
Donald Kinghorn

Hi Prachi, I just checked on my laptop (similar with 1060). This systems has updated to Win10 19.03 since I had last used anaconda on it. It was badly broken! I have reinstalled to the latest anaconda version as described in this post. Just now checking some things...

The anaconda developers have been making a lot of changes recently and there have been problems reported. I just now had errors when trying to use Powershell. These went away when I did an update i.e.
conda update conda
conda update --all

I re-did the tf-gpu install and ran the MNIST notebook everything looks OK.

I noticed one thing strange in what you reported Cudnn = 7.7.0 In the install I just did it added these, (conda list shows this)
cudatoolkit 10.0.130 0
cudnn 7.6.0 cuda10.0_0
tensorflow-gpu 1.13.1 h0d30ee6_0

You may have gotten a cudnn 7.7.0 somehow ??? (that would be a version conflict) I suggest that you create a new env (maybe tf-gpu-new) and the then do a fresh tensorflow-gpu install in that.

I'm hopeful this will fix things for you. Post back with more info if it doesn't. --Don

Posted on 2019-06-15 01:54:15
Suhaas Valanjoo

I installed Anaconda for Python 3.7 on Windows 10. If I used activate or "conda activate" command in the Powershell prompt, it just would not work, but it works very well in windows command line. Basically, I could not see the indicator that asterisk moves to preceding position of the newly activated environment, but I clearly see it with windows prompt. There is an open issue on github. https://github.com/conda/co...

https://uploads.disquscdn.c...

Pictures attached https://uploads.disquscdn.c... https://uploads.disquscdn.c... https://uploads.disquscdn.c...

Posted on 2019-06-15 02:29:29
Donald Kinghorn

They made some changes recently that have cause some trouble. I had a similar problem and I think I resolved by doing "conda init" in Powershell ...

It has been a bit of a mess. Try doing a conda init and then conda activate tf-gpu

Posted on 2019-06-18 21:24:21
Suhaas Valanjoo

Thanks, Donald. Big mess, true :-). I spent hours trying to figure out the workaround.

Posted on 2019-06-18 22:41:51
Suhaas Valanjoo

'Conda init' did not solve the issue (picture atatched)

Posted on 2019-06-20 01:52:45
Donald Kinghorn

Dang! I know I had the same issue but don't remember what fixed it. Try this (if you haven't already),
open PS then do cmd to change to dos shell then
conda update conda
conda update --all
exit
exit
Then try again.
I was frustrated with anaconda for a couple of weeks because of some of their changes. I messed with it a few times including re-installs to older releases and then finally to the latest update. Everything is working the way I want now but the PS issues were to most annoying!

Posted on 2019-06-20 16:26:17
Padmakumar Nambiar

Hello Donald, Great article... Thanks for sharing!

Now, I'm facing the following issue related to performance of my ML code and thought you may be able to help please...

This particular call in "mrcnn.model.py" takes about 13 secs to return !!

r = model.detect([image], verbose=0)[0]

… which finally ends in the following call in the "sessions.py" file,...

ret = tf_session.TF_SessionRunCallable(self._session._session, self._handle, args, status, run_metadata_ptr)

… which in turn calls the following function call in "pywrap_tensorflow_internal.py" (which consumes most of the time delay mentioned above):

def TF_SessionRunCallable(session, handle, feed_values, out_status, run_metadata):

return _pywrap_tensorflow_internal.TF_SessionRunCallable(session, handle, feed_values, out_status, run_metadata)

TF_SessionRunCallable = _pywrap_tensorflow_internal.TF_SessionRunCallable

My TF version is: 1.13.1

Any help will be greatly appreciated. Thanks.

Posted on 2019-06-18 09:48:17
Donald Kinghorn

I don't have any good ideas on that. It does look like it is something with tensorflow itself since the tie is going in to the python wrapper call. There is a lot going on in the model with mask-rcnn and it could be using resnet-50 or -101 ... I've never used this so I don't know what would be good/bad performance timing.

I wish I could offer you better advise but I'm afraid I don't have any ideas.

Posted on 2019-06-18 21:17:55
Padmakumar Nambiar

Hi Donald, Thanks for your reply... Just in case it helps, I uninstalled my existing TF version, and installed tf-gpu-1.3.1 instead, and now I find a bunch of errors around the pywrap function mentioned above (pasted below). I changed it back to 1.13.1, those errors disappeared, but it started taking 13 secs for each frame in the video.

<<anaconda prompt="">>python video_demo.py
Traceback (most recent call last):
File "C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow.py", line 58, in <module> from tensorflow.python.pywrap_tensorflow_internal import *
File "C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py", line 28, in <module> _pywrap_tensorflow_internal = swig_import_helper()
File "C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py", line 24, in swig_import_helper _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "C:\Users\user\Anaconda3\lib\imp.py", line 242, in load_module return load_dynamic(name, filename, file)
File "C:\Users\user\Anaconda3\lib\imp.py", line 342, in load_dynamic return _load(spec)
ImportError: DLL load failed: The specified module could not be found.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "video_demo.py", line 2, in <module> from visualize_cv2 import model, display_instances, class_names
File "C:\Users\user\eclipse-workspace\New_mask_RCNN\visualize_cv2.py", line 6, in <module> from mrcnn import utils
File "C:\Users\user\eclipse-workspace\New_mask_RCNN\mrcnn\utils.py", line 15, in <module> import tensorflow as tf
File "C:\Users\user\Anaconda3\lib\site-packages\tensorflow\__init__.py", line 24, in <module> from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\__init__.py", line 49, in <module> from tensorflow.python import pywrap_tensorflow
File "C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow.py", line 74, in <module> raise ImportError(msg)

ImportError: Traceback (most recent call last):
File "C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow.py", line 58, in <module> from tensorflow.python.pywrap_tensorflow_internal import *
File "C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py", line 28, in <module> _pywrap_tensorflow_internal = swig_import_helper()
File "C:\Users\user\Anaconda3\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py", line 24, in swig_import_helper _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "C:\Users\user\Anaconda3\lib\imp.py", line 242, in load_module return load_dynamic(name, filename, file)
File "C:\Users\user\Anaconda3\lib\imp.py", line 342, in load_dynamic return _load(spec)
ImportError: DLL load failed: The specified module could not be found.

Posted on 2019-06-19 03:52:37
nandodmelo

I will try your approach, but I have a trivial question, what’s the difference between GPU acceleration without installing CUDA and Installing CUDA/cuDNN? Or is it the same thing?

Posted on 2019-06-24 15:31:04
Donald Kinghorn

That is a very good question and it's important to understand!

I say "without installing CUDA" but really all of the libraries are installed along with tensorflow-gpu when you use the Anaconda package. They are kept in the resources sub-directories along with the TensorFlow-GPU package install. They are isolated from the rest of your setup by creating the "env" i.e. tf-gpu. That does 3 things for you;

1) It keeps you from having to do a full manual install of CUDA (which requires that you also install MS Visual Studio)

2) It gives you the correct versions of the CUDA libraries that you need! That last point is really important because if you just did a CUDA install you would likely be getting the latest version from NVIDIA which right now is 10.1. Google is compiling TensorFlow 1.13, 1.14 and 2.0.0-beta1 using CUDA 10.0 i.e. CUDA 10.1 wouldn't work!

3) lastly it gives you an easy way to have multiple packages installed that are using different version of the CUDA, cuDNN libraries. For example you could do the same kind of install for PyTorch linked against CUDA 10.1 or 9.2 or whatever.

It also makes upgrade paths a lot cleaner too, just make a new env and install a new version. Bottom line is it helps you keep from having a mess on your system.

P.S. I'm working on a short post right now that will use the tensorflow-gpu seup in this post as the basis for setting up a TensorFlow 2.0.0-beta1 install for testing ( we can do that because TF 2 beta has the same CUDA dependencies at the other versions :-) --Don

Posted on 2019-06-25 17:17:29

Thank you. It works. I got like 20 times speedup using GTX 1060 compared to my CPU i7-3770 @ 3.4GHz 3.9GHz.
The GPU utilization was around 40%. Very crucial detail is the neural network structure.
If your network is too small you won't gain any speedup. I tried with the simple MNIST model example on TensorFlow tutorial and I gained nothing.
That mistake made me thought I installed GPU version incorrectly until I try LeNet-5 model and saw 20 times speedup.

Posted on 2019-06-24 20:28:46
Donald Kinghorn

Good point! I didn't really discuss that in the post. Small test jobs and a lot of "learning" examples might not show much or any improvement in performance on GPU. It's when the model and/or the data-set gets larger that the GPU really starts to be a BIG advantage.

Posted on 2019-06-25 16:59:19
Sri Harsha

Hey,

I see the tensorflow page lists that python 3.6 to be the requirement, I dont understand how you were able to install it on python 3.7.
Can you please explain that ?

Thanks

Posted on 2019-06-28 15:04:42
Donald Kinghorn

It's not strictly a requirement. Although it's probably good not to use a Python version older than that. The dependencies are determined when the code is compiled. The devs working on the build for Anaconda probably used the latest Python in their environment when they compiled TensorFlow-GPU for their package.

"conda" is a package manager. When you do "conda install tensorflow-gpu" it is going to pull the package from the official build on Anaconda cloud. conda will have a list of the dependencies for the package and make sure that they are met when it does the install (it will warn you if it needs to downgrade any existing package in the env). You don't have to worry about it.

However, you probably could use Python 3.6 is you really wanted to. You can set a Python base version when you create an env. For example you could try something like

conda create --name tf-gpu-py36 python=3.6
conda activate tf-gpu-py36
conda install tensorflow-gpu

let me try that ... OK... in that env "conda list" shows (leaving out most of the output)
...
cudatoolkit 10.0.130 0
cudnn 7.6.0 cuda10.0_0
...
python 3.6.8 h9f7ef89_7
...
tensorboard 1.13.1 py36h33f27b4_0
tensorflow 1.13.1 gpu_py36h9006a92_0
tensorflow-base 1.13.1 gpu_py36h871c8ca_0
tensorflow-estimator 1.13.0 py_0
tensorflow-gpu 1.13.1 h0d30ee6_0

I tested it and it works fine :-) Personally I would keep the latest Python 3.7 unless you have some code that has a real conflict with it that you want to use in the same env. You have a LOT of control over versioning with conda!

Posted on 2019-06-28 16:24:03
Mr 53,461

Hi Donald,
I want to add one of the the tensorflow/examples/ that I find on github
Where is the 'right' place to put it?

(base) PS C:\Users\jeff> conda info --envs
base * C:\jeff\anaconda3
tf-gpu C:\jeff\anaconda3\envs\tf-gpu

Thanks!

Posted on 2019-07-02 18:17:19
Donald Kinghorn

You can put source anywhere you like. I usually create a directory under my user directory called projects (I do this on Linux and Windows) and then I create directories in there for anything that I'm working on.

If you open Powershell you can do
pwd ( to see what dir you are in (should be C:\Users\jeff )
mkdir projects
cd projects
mkdir tf-examples
cd tf-examples

And then but your stuff from GitHub in there (of course you can use any dir name you like and you can use the GUI file manager to do this too)

One thing to note, I advise you to *not* use spaces in any directory or file names. Windows allows that but it can cause problems with Python.

Posted on 2019-07-02 23:13:43
Mr 53,461

Works!
For anyone else's benefit...
In anaconda powershell I say 'conda activate tf-gpu' and the prompt look like this: (tf-gpu) PS C:\Users\jeff\projects\tf-examples.
Then I did a big github clone: git clone https://github.com/tensorfl...

Now when I start a jupyter notebook I can import models

Thanks a lot Donald!!

Posted on 2019-07-03 14:54:30
Lina Chato

Hi,
Thank you for the great article....
I have Quadro P2000 GPU and this is not good for DL/ML applications. so I buy TITAN. Please I would be thankful if you could advise! Should I reinstall tensorflow in my system again after mounting the new GPU and install its driver?

Posted on 2019-07-05 16:42:37
Donald Kinghorn

I don't think you will need to reinstall TensorFlow if your setup has been working well for you.

You should be able to install the Titan and then restart. The system should detect the card and then update the driver. Be sure to check the driver version like I have suggested in this post. It would be good to update it to the latest release (430 right now I believe)

Posted on 2019-07-05 18:20:09
Thomas Chu

Thanks a lot!
It works for my system(HP Z4 G4+Titan).

Posted on 2019-07-08 19:52:56
pl709

Sir, you post is like a shining light in a stormy skyI've battling with the system/CUDA/cuDNN/py/tf-gpu compatibility problems for a looooong time

Posted on 2019-07-26 02:33:05
Donald Kinghorn

Ha ha, I'm glad you were able to see the light! I do appreciate your (and everyone's) kind works :-) --Don

Posted on 2019-07-26 15:27:24
Divyansh Jain

Excellent Article!! Thank you so much :)

Posted on 2019-08-03 12:46:23
Brandon Elford

Amazing, you have ended two weeks worth of time trying to install TF-GPU. Thank you so much!

Posted on 2019-09-09 15:35:16
Andrew Samsock

Great install guide! Thank you. I'm getting the following error when trying to run the NMIST code sample.

2019-09-23 14:46:59.020150: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'cupti64_100.dll'; dlerror: cupti64_100.dll not found
2019-09-23 14:46:59.024270: W tensorflow/core/profiler/lib/profiler_session.cc:182] Encountered error while starting profiler: Unavailable: CUPTI error: CUPTI could not be loaded or symbol could not be found.
2019-09-23 14:46:59.035553: I tensorflow/core/platform/default/device_tracer.cc:641] Collecting 0 kernel records, 0 memcpy records.
2019-09-23 14:46:59.038756: E tensorflow/core/platform/default/device_tracer.cc:68] CUPTI error: CUPTI could not be loaded or symbol could not be found.

Posted on 2019-09-23 21:49:08
Donald Kinghorn

Hummm, I'm not sure about that one. Did you try to do a pip install as outlined on the TensorFlow docs page? If so you may be having some library conflicts. If not, and you have just followed what is in this guide then I'm a not sure ... let me try something ...

OK, I tried my existing install on this Win10 laptop worked fine (was using Python 3.7.3 and tensorflow 13.1) Then in that env I did conda update --all which moved python to 3.7.4 and tensorflow to 1.14 BROKEN! I'm getting " ModuleNotFoundError: No module named 'win32api'" when I start up a jupyter notebook

... I've been messing around for a couple of hours trying to find the problem ... I tried creating an env with python 3.7.3 and tensorflow 1.13.1 but hit same problem. I can run everything fine from the console but jupyter and jupyterlab are broken ( but I'm not sure if it's jupyter or something that jupyter depends on ??)

Looks like something is messed up with some recent update in Anaconda. I was not able to reproduce the error you saw but got some new ones of my own.

This may take a while to sort out YUK! ... sorry ... I'll hack on it some more but right now my newer envs are all broken for Jupyter on Windows

Posted on 2019-09-24 00:57:09
Donald Kinghorn

If you are still running into problems ... TensorFlow 2rc seem to be working OK ... see above

Posted on 2019-09-26 16:27:33
Donald Kinghorn

Just to let everyone know ... Today (Sept. 24 2019) I did a fresh install of Anaconda Python 2019.07 and it is not working correctly! My jupyter notebooks launch and fail to load a kernel with a missing win32api error. (command line runs seem to be OK) This is probably a broken Anaconda distribution since the 2019.03 install I had on the systems before this reinstall was working fine.

I did all of the updates like I mention in this post. Things might work OK with this 2019.07 install if you don't do the updates ???

For anyone having trouble. Here is a link to the Anaconda installer archives where you can find the 2019.03 package. I'm going to revert to that and see if things are working again ... Crud! That didn't work either ... this may just be broken for a while until things get fixed upstream --Don

Posted on 2019-09-25 03:49:14
김학배

Thank you very much.
In the meantime, I couldn't install the tensorflow-gpu version on my PC.

thank you.

Posted on 2019-09-26 04:33:42
Donald Kinghorn

TensorFlow 2rc seem to be working OK ... see above

Posted on 2019-09-26 16:26:31
Donald Kinghorn

For everyone that has been having trouble with TensorFlow 1.14 working correctly in a Jupyter notebook (see my comment from Sept 24) I have good news...
TensorFlow 2rc seems to be working fine!
If you want to start using TensorFlow 2 (and you should) then you can install that using pip over the top of your install that was outlined above.

I wrote a post about doing this in June when beta1 had come out...
"Install TensorFlow 2 beta1 (GPU) on Windows 10 and Linux with Anaconda Python (no CUDA install needed)"
https://www.pugetsystems.co...

In that post the TensorFlow 2 version was beta1 It is now a release candidate and the version number is 2.0.0-rc1 so you would want to use the following command

pip install tensorflow-gpu==2.0.0-rc1

Please read the post mentioned above. It will have you clone your existing env to a new name and then do the pip TF2 install.

Posted on 2019-09-26 16:26:11
Pranab Das

First, thanks for the excellent guide :-)

Getting the following exception at the last stage of MNIST training:
NotFoundError Traceback (most recent call last)
<ipython-input-8-792ff921728a> in <module>
4
5 model.fit(train_images, train_labels, batch_size=128, epochs=15, verbose=1,
----> 6 validation_data=(test_images, test_labels), callbacks=[tensor_board])
7
8 print('Training took {} seconds'.format(time.time()-start_time))

~\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
778 validation_steps=validation_steps,
779 validation_freq=validation_freq,
--> 780 steps_name='steps_per_epoch')
781
782 def evaluate(self,

~\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py in model_iteration(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq, mode, validation_in_fit, prepared_feed_values_from_dataset, steps_name, **kwargs)
372 # Callbacks batch end.
373 batch_logs = cbks.make_logs(model, batch_logs, batch_outs, mode)
--> 374 callbacks._call_batch_hook(mode, 'end', batch_index, batch_logs)
375 progbar.on_batch_end(batch_index, batch_logs)
376

~\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\callbacks.py in _call_batch_hook(self, mode, hook, batch, logs)
246 for callback in self.callbacks:
247 batch_hook = getattr(callback, hook_name)
--> 248 batch_hook(batch, logs)
249 self._delta_ts[hook_name].append(time.time() - t_before_callbacks)
250

~\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\callbacks.py in on_train_batch_end(self, batch, logs)
529 """
530 # For backwards compatibility.
--> 531 self.on_batch_end(batch, logs=logs)
532
533 def on_test_batch_begin(self, batch, logs=None):

~\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\callbacks_v1.py in on_batch_end(self, batch, logs)
360 self._total_batches_seen += 1
361 if self._is_profiling:
--> 362 profiler.save(self.log_dir, profiler.stop())
363 self._is_profiling = False
364 elif (not self._is_profiling and

~\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\eager\profiler.py in save(logdir, result)
142 logdir, 'plugins', 'profile',
143 datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))
--> 144 gfile.MakeDirs(plugin_dir)
145 maybe_create_event_file(logdir)
146 with gfile.Open(os.path.join(plugin_dir, 'local.trace'), 'wb') as f:

~\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\lib\io\file_io.py in recursive_create_dir(dirname)
436 errors.OpError: If the operation fails.
437 """
--> 438 recursive_create_dir_v2(dirname)
439
440

~\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\lib\io\file_io.py in recursive_create_dir_v2(path)
451 errors.OpError: If the operation fails.
452 """
--> 453 pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path))
454
455

NotFoundError: Failed to create a directory: ./logs/LeNet-MNIST-1\plugins\profile\2019-10-10_15-12-16; No such file or d

Posted on 2019-10-10 09:48:13
Pranab Das

Somehow my yesterday's post got deleted; hence reposting. While training the model with timing in the MNIST example, I get the following error. Any suggestion?
Train on 60000 samples, validate on 10000 samples
Epoch 1/15
128/60000 [..............................] - ETA: 13:23 - loss: 2.3159 - acc: 0.0625

---------------------------------------------------------------------------
NotFoundError Traceback (most recent call last)
<ipython-input-7-792ff921728a> in <module>
4
5 model.fit(train_images, train_labels, batch_size=128, epochs=15, verbose=1,
----> 6 validation_data=(test_images, test_labels), callbacks=[tensor_board])
7
8 print('Training took {} seconds'.format(time.time()-start_time))

~\Anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
778 validation_steps=validation_steps,
779 validation_freq=validation_freq,
--> 780 steps_name='steps_per_epoch')
781
782 def evaluate(self,

~\Anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py in model_iteration(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq, mode, validation_in_fit, prepared_feed_values_from_dataset, steps_name, **kwargs)
372 # Callbacks batch end.
373 batch_logs = cbks.make_logs(model, batch_logs, batch_outs, mode)
--> 374 callbacks._call_batch_hook(mode, 'end', batch_index, batch_logs)
375 progbar.on_batch_end(batch_index, batch_logs)
376

~\Anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\callbacks.py in _call_batch_hook(self, mode, hook, batch, logs)
246 for callback in self.callbacks:
247 batch_hook = getattr(callback, hook_name)
--> 248 batch_hook(batch, logs)
249 self._delta_ts[hook_name].append(time.time() - t_before_callbacks)
250

~\Anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\callbacks.py in on_train_batch_end(self, batch, logs)
529 """
530 # For backwards compatibility.
--> 531 self.on_batch_end(batch, logs=logs)
532
533 def on_test_batch_begin(self, batch, logs=None):

~\Anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\callbacks_v1.py in on_batch_end(self, batch, logs)
360 self._total_batches_seen += 1
361 if self._is_profiling:
--> 362 profiler.save(self.log_dir, profiler.stop())
363 self._is_profiling = False
364 elif (not self._is_profiling and

~\Anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\eager\profiler.py in save(logdir, result)
142 logdir, 'plugins', 'profile',
143 datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))
--> 144 gfile.MakeDirs(plugin_dir)
145 maybe_create_event_file(logdir)
146 with gfile.Open(os.path.join(plugin_dir, 'local.trace'), 'wb') as f:

~\Anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\lib\io\file_io.py in recursive_create_dir(dirname)
436 errors.OpError: If the operation fails.
437 """
--> 438 recursive_create_dir_v2(dirname)
439
440

~\Anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\lib\io\file_io.py in recursive_create_dir_v2(path)
451 errors.OpError: If the operation fails.
452 """
--> 453 pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path))
454
455

NotFoundError: Failed to create a directory: ./logs/LeNet-MNIST-1\plugins\profile\2019-10-11_10-51-33; No such file or directory

Posted on 2019-10-11 05:26:27
Donald Kinghorn

yes it looks like it was getting blocked because of links or something ...
The trouble you are having is most likely from not having the "logs" directory created before the call to tensorboard this happens often (to me too) there are some other comments that have the same problem.

Check that your directories exist and that they are visible from the directory that you are starting your jupyter notebook in --Don

Posted on 2019-10-11 14:53:00
Donald Kinghorn

Hi Pranab, I finally found the the problem! Look at the most recent comment (I made changes in the post too) -- Best wishes --Don

Posted on 2019-10-16 17:11:53
Pranab Das

Thanks :-)

Posted on 2019-10-18 06:01:25
Pranab Das

While other things went well, the utilization of NVIDIA GPU remained 0%. Could you throw some light on how to address?

Posted on 2019-10-30 03:24:23

Where are you looking at GPU utilization? If that is in Task Manager, it is worth noting that the default GPU usage shown there is "3D" (the way that games and other graphics-intensive applications use the GPU). If you want to use CUDA usage instead, go to Task Manager -> Performance tab -> GPU (on the left side) -> select "Cuda" from one of the drop-down menus on the right side.

Posted on 2019-10-30 15:43:21
Pranab Das

Thanks :-)
The problem manifested in different ways. One's what you responded to. The other is GPU getting out of memory upon increasing the number of deep neurons. At least that is what the curt message is interpreted by Tensorflow community. I thought it should not happen with the NVIDIA GPU kicking in (after following your instructions). That the code is ok was tested by running it in a cloud env.

Posted on 2019-10-30 16:18:20
Donald Kinghorn

Running out of GPU mem is, unfortunately, common! [That's why some folks end up getting Titan RTX with 24GB but usually 12GB is OK for lots of work 2-4GB cards can't really do anything but small models] One thing to try if you think you are getting an OOM (out of memory) error but shouldn't. Sometimes code crashes can cause a memory leak in GPU mem in a Jupyter notebook. You can try shutting down the kernel and notebook and restarting the session.

Most cloud GPU instances will start up with a Tesla GPU with at least 16GB (and give your instance exclusive access to the GPU ... that's why they are kind of expensive)

Posted on 2019-10-30 16:39:03
Donald Kinghorn

Thanks William, that's a great tip!

Posted on 2019-10-30 16:28:38
Donald Kinghorn

try;
conda activate tf-gpu
python

import tensorflow as tf

tf.test.is_gpu_available()

tf.test.gpu_device_name()

There are several test functions in tensorflow. Those 2 should show you information about you current tf env

If you have done an env with TF 2.0 (that's now the default if you conda install tensorflow-gpu )
then you could play with this, (you can change tf.device("/cpu:0") to see the difference in time)

import tensorflow as tf
import time

n = 1000
dtype = tf.float32
start = time.time()
with tf.device("/gpu:0"):
a = tf.Variable(tf.ones((n, n), dtype=dtype))
b = tf.Variable(tf.ones((n, n), dtype=dtype))
c = tf.matmul(a, b)
c_norm=tf.linalg.norm(c)

print('cnorm = ',c_norm)
print( 'took', time.time()-start, 'sec')

The comment striped out the indentation for a,b,c,c_norm! ... and added extra line breaks

Posted on 2019-10-30 16:19:11
Torben Andersen

I followed your instructions by the beginning of 2019 and the system is still working beautifully (thanks Donald!). I just got a paper published based on my work with the system. I now wish to upgrade to Tensorflow 2 but I am scared that I break anything if I try to update/upgrade. Has anybody done this and did it work? Any instructions?

Posted on 2019-10-11 06:28:36
Torben Andersen

My apologies for asking the above question. I thought that I had double-checked that it had not already been dealt with but I was wrong. It's explained in all detail. Again thanks for a wonderful instruction.

Posted on 2019-10-11 10:34:36
Donald Kinghorn

Hi Torben, I was going to add another comment related to this. There is now a problem with what I posted below The Anaconda folks updated the build of TF 1.14 to be linked with CUDA 10.1 That will fail with the pip install of TF 2.0 because it is linked against CUDA 10.0

You might be able to do the clone procedure in the post I liked below. But check to see what CUDA version you have in your existing TF env.
from inside your existing TF env

conda list

will show all the packages that are installed. Look for the the cuda-toolkit package, if it is version 10.0 then you can do the clone env and pip install of TF 2.0
If you see version 10.1 then it will not work.

The Anaconda folks will likely have an official TF 2 package built soon ... I just checked and it not there today.

basically something like the following in a new env will work too.
conda install tensorflow-gpu=1.13.1
pip install tensorflow-gpu==2.0

1.13.1 sets up the correct CUDA libs and then pip install for 2.0 will work

Posted on 2019-10-11 14:29:15
夏天成

Thanks for a such helpful article! It really helped me a lot since there were a plenty of tutorials on the internet that i tried were not useful.
But i got a problem when i try to run the code you post in the article. After i execute "Create the LeNet-5 convolution neural network architecture". I GOT THIS WARNING BELOW THE CELL:

"WARNING:tensorflow:From C:\Users\Jason\.conda\envs\tf-gpu\lib\site-packages\tensorflow\python\ops\init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor"

And i ignored this, my training results have Errors like this:

Train on 60000 samples, validate on 10000 samples
Epoch 1/15
128/60000 [..............................] - ETA: 8:56 - loss: 2.3053 - acc: 0.0703
---------------------------------------------------------------------------
NotFoundError Traceback (most recent call last)
<ipython-input-7-792ff921728a> in <module>
4
5 model.fit(train_images, train_labels, batch_size=128, epochs=15, verbose=1,
----> 6 validation_data=(test_images, test_labels), callbacks=[tensor_board])
7
8 print('Training took {} seconds'.format(time.time()-start_time))

~\.conda\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
778 validation_steps=validation_steps,
779 validation_freq=validation_freq,
--> 780 steps_name='steps_per_epoch')
781
782 def evaluate(self,

~\.conda\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py in model_iteration(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq, mode, validation_in_fit, prepared_feed_values_from_dataset, steps_name, **kwargs)
372 # Callbacks batch end.
373 batch_logs = cbks.make_logs(model, batch_logs, batch_outs, mode)
--> 374 callbacks._call_batch_hook(mode, 'end', batch_index, batch_logs)
375 progbar.on_batch_end(batch_index, batch_logs)
376

~\.conda\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\callbacks.py in _call_batch_hook(self, mode, hook, batch, logs)
246 for callback in self.callbacks:
247 batch_hook = getattr(callback, hook_name)
--> 248 batch_hook(batch, logs)
249 self._delta_ts[hook_name].append(time.time() - t_before_callbacks)
250

~\.conda\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\callbacks.py in on_train_batch_end(self, batch, logs)
529 """
530 # For backwards compatibility.
--> 531 self.on_batch_end(batch, logs=logs)
532
533 def on_test_batch_begin(self, batch, logs=None):

~\.conda\envs\tf-gpu\lib\site-packages\tensorflow\python\keras\callbacks_v1.py in on_batch_end(self, batch, logs)
360 self._total_batches_seen += 1
361 if self._is_profiling:
--> 362 profiler.save(self.log_dir, profiler.stop())
363 self._is_profiling = False
364 elif (not self._is_profiling and

~\.conda\envs\tf-gpu\lib\site-packages\tensorflow\python\eager\profiler.py in save(logdir, result)
142 logdir, 'plugins', 'profile',
143 datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S'))
--> 144 gfile.MakeDirs(plugin_dir)
145 maybe_create_event_file(logdir)
146 with gfile.Open(os.path.join(plugin_dir, 'local.trace'), 'wb') as f:

~\.conda\envs\tf-gpu\lib\site-packages\tensorflow\python\lib\io\file_io.py in recursive_create_dir(dirname)
436 errors.OpError: If the operation fails.
437 """
--> 438 recursive_create_dir_v2(dirname)
439
440

~\.conda\envs\tf-gpu\lib\site-packages\tensorflow\python\lib\io\file_io.py in recursive_create_dir_v2(path)
451 errors.OpError: If the operation fails.
452 """
--> 453 pywrap_tensorflow.RecursivelyCreateDir(compat.as_bytes(path))
454
455

NotFoundError: Failed to create a directory: ./logs/LeNet-MNIST-1\plugins\profile\2019-10-16_13-44-58; No such file or directory

Any idea? Would appreciate your reply!

Posted on 2019-10-16 13:09:43
Donald Kinghorn

This has been coming up more often recently. You can ignore all of the message except that last line :-) What is happening is that Tensorboard is not able to write into the log directory.

I am going to try to reproduce the error and see if I can find what is going on...
**************************
OK Found it!
The older version (1.13.1) was able to use UNIX like file paths on Windows but it looks like version 1.14 does not
you need to change this,
tensor_board = tf.keras.callbacks.TensorBoard('./logs/LeNet-MNIST-1')
to this
tensor_board = tf.keras.callbacks.TensorBoard('.\logs\LeNet-MNIST-1')

I also noticed that you no longer need to create the directory before hand i.e. if the directors .logs\LeNet=MNIST-1 doesn't exist when you start the job run it will be created automatically.
****************************

I'll add a note to the post text! --dbk

Posted on 2019-10-16 17:09:52
夏天成

Hi Donald! Thanks for such a quickly reply! My test session is running and i finally can use GPU for my study! But i got problem when i followed your last step to open the TensorBoard, my browser just don't open the address for me. Tried many ways posted on the internet, even can't get into the TensorBoard page. Does anyone the face the same problem like me?

Posted on 2019-10-17 15:55:43
Donald Kinghorn

Yes! I just tried this too ... It works fine on Firefox and Edge but Chrome does not open the page! I use Firefox when I work with Jupyter notebooks so I didn't see this.

It may be because the local server that Tensorboard is running from is using http:// instead of https:// Chrome is starting to block that!

I tried using localhost:6006 in Chrome and that did work OK. ... I added a note in the post, thanks! :-)

Posted on 2019-10-17 17:49:28
夏天成

Thanks a lot!

Posted on 2019-10-21 13:34:10
107united

Do we need to install Visual Studio as a pre requisite for CUDA and CUDNN ?

Posted on 2019-11-11 11:22:03
Donald Kinghorn

No. This is the beauty of using Anaconda. You don't need to install CUDA. The needed libraries are installed along with tensorflow-gpu in the env that you create.
After you install tensorflow-gpu do

conda list

that will show you all the packages that are in the env. You should see something like,
cudatoolkit 10.0.130 0
cudnn 7.6.0 cuda10.0_0
in the list.

Posted on 2019-11-11 18:13:20