Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1134
Dr Donald Kinghorn (Scientific Computing Advisor )

Build TensorFlow-GPU with CUDA 9.1 MKL and Anaconda Python 3.6 using a Docker Container

Written on April 12, 2018 by Dr Donald Kinghorn
Share:


I've been working a lot with TensorFlow lately. The Google team and outside contributors do a very good job of providing binary packages and fresh docker images for TensorFlow, but, it's not really built the way I would prefer. They are, understandably, somewhat conservative with the builds. The new 1.7 version release only includes CPU vector unit optimizations up to AVX which goes back to 2011. The Python in the official builds is "Python.org 3.5" and I prefer Anaconda 3.6. The GPU support is provided by CUDA 9.0 instead of the current 9.1.

Last week I went through how to do a custom build of TensorFlow 1.7 for CPU using a build environment inside a docker container. That build included links to MKL-ML (Intel Math Kernel Library - MKL-DNN). It was also built with the Anaconda Python 3.6 which is what I prefer to use for Python. The result was a TensorFlow build that showed a 2.5 fold speedup over the official build. Now I'll go through the same basic process but this time include GPU acceleration with the current CUDA 9.1 cuDNN 7. We'll make a new docker container with all of the dependences and configuration needed to do the build. That will avoid having to clutter the host system with the needed build environment.

The GPU acceleration in this build will not see a dramatic improvement like the CPU build did because the CUDA performance from 9.0 is already very good. Moving to 9.1 is just making the build current and a better fit for my development environment. There will be a big improvement for CPU performance and it will be built against the Python that I actually use i.e. Anaconda 3.6.

I will go through the same step-by-step instruction as I did in the CPU build but will include the necessary changes and details to get a GPU accelerated build. I recommend that you read though the CPU build post before you try this.

Note: I did have some difficulties with this build. It looks as though the Bazel build tool (or it's configuration) had some problems. It would "forget" about library paths at various points during linking. I had to do a few "hacks" to get it to work right. Hopefully this will save you some grief if you are running into the same kinds of issues.

Note: Another bazel problem! While I was working on this post bazel was updated from 11.1 to 12.0 and that broke the tensorFlow build (bazel is also a Google project). In this post I explicitly use bazel 11.1 and have instructions on how to get and install it.


Step-by-Step Instructions

1) If needed, follow my guide to install and configure Docker

2) Download the sources for TensorFlow

  • Make a director to do the build,
mkdir TF-build-gpu
cd Tf-build-gpu
  • Get the TensorFlow source tree and "checkout" the branch you want,
git clone https://github.com/tensorflow/tensorflow
cd tensorflow/
git checkout r1.7

3) Setup the docker container build directory

  • From the TF-build-gpu directory create a directory for your Dockerfile and some other files we will copy into the container.
mkdir dockerfile
cd dockerfile

4) Get the Anaconda3 install shell archive file and the bazel 11.1 deb file,

5) Create files called cuda.sh, cuda.conf in the dockerfile directory

These files will be copied into the docker image we will constructing.

cuda.sh

This will set the PATH environment for cuda in the container.

export PATH=$PATH:/usr/local/cuda/bin

cuda.conf

This will add the cuda libraries to the default library path. The "stubs" directory is one of the hacks that I needed to keep the libraries paths working during this build.

/usr/local/cuda/lib64
/usr/local/cuda/extras/CUPTI/lib64
/usr/local/cuda/targets/x86_64-linux/lib/stubs

6) Create the Dockerfile to build the container

  • Put the following in a file named Dockerfile in the dockerfile directory (note the capital "D" in the file name)
# Dockerfile to setup a build environment for TensorFlow
# using Intel MKL and Anaconda3 Python
# GPU support with CUDA 9.1 and cudnn7.1

FROM nvidia/cuda:9.1-cudnn7-devel-ubuntu16.04

MAINTAINER nobody, not even me

# Add a few needed packages to the base Ubuntu 16.04
# OK, maybe *you* don't need emacs :-)
RUN \
    apt-get update && apt-get install -y \
    build-essential \
    curl \
    emacs-nox \
    git \
    openjdk-8-jdk \
    && rm -rf /var/lib/lists/*

# Use version 11.1 bazel! install from the deb file.
COPY bazel_0.11.1-linux-x86_64.deb /root/
RUN \
  cd /root; dpkg -i bazel_0.11.1-linux-x86_64.deb && \
  rm -f bazel_0.11.1-linux-x86_64.deb

# Copy in and install Anaconda3 from the shell archive
# Anaconda3-5.1.0-Linux-x86_64.sh
COPY Anaconda3* /root/
RUN \
  cd /root; chmod 755 Anaconda3*.sh && \
  ./Anaconda3*.sh -b && \
  echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> .bashrc && \
  rm -f Anaconda3*.sh

# Copy in the CUDA configuration files
COPY cuda.sh /etc/profile.d/
COPY cuda.conf /etc/ld.so.conf.d/

# That's it! That should be enough to do a TensorFlow 1.7 GPU build
# using CUDA 9.1 Anaconda Python 3.6 Intel MKL with gcc 5.4

This Dockerfile will,

  • use the official NVIDIA CUDA 9.1 image on an Ubuntu 16.04 base
  • install some needed packages
  • add the apt repo for bazel and install it
  • Install Anaconda3 Python
  • Add the configuration files needed for CUDA 9.1

7) Create the container

docker build -t tf-build-1.7-gpu .

That will create the container we will do the TensorFlow build in. This is a large container! It will take awhile to build and install everything.

8) Start the container and bind the directory with the source tree

docker run --runtime=nvidia --rm -it -v $HOME/projects/TF-build-gpu:/root/TF-build-gpu tf-build-1.7-gpu

That will start the container. Note that I have my directory for the build in $HOME/projects/TF-build-gpu and that is being bound into the container at /root/TF-build-gpu.

9) Configure TensorFlow build

Now that you are in the container,

cd /root/TF-build-gpu/tensorflow/

./configure

./configure will ask a lot of questions. It should see Anaconda Python 3.6 as the system Python and use that. You will probably want to answer "No" to most of the questions. Answer "Yes" to GPU support since we set up CUDA in this container. I set the CUDA version to 9.1 and included compute capabilities from 5.2 to 7.0. That includes GPU's from Maxwell to the current V100 (Titan V). You could add support for older cards. You can find a list of compute capabilities on the CUDA Wikipedia page.

Here's are my answers to configure,

root@dc40c84fcef1:~/TF-build/tensorflow# ./configure
/root/anaconda3/bin/python
.
Extracting Bazel installation...
You have bazel 0.11.1 installed.
Please specify the location of python. [Default is /root/anaconda3/bin/python]:


Found possible Python library paths:
 /root/anaconda3/lib/python3.6/site-packages
Please input the desired Python library path to use.  Default is [/root/anaconda3/lib/python3.6/site-packages]

Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: y
jemalloc as malloc support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
No Amazon S3 File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Apache Kafka Platform support? [y/N]: n
No Apache Kafka Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: n
No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 9.1


Please specify the location where CUDA 9.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:


Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]:7.1


Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:


Do you wish to build TensorFlow with TensorRT support? [y/N]: n
No TensorRT support will be enabled for TensorFlow.

Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2]5.2,6.0,6.1,7.0


Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:


Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:


Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Not configuring the WORKSPACE for Android builds.

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
   --config=mkl         	# Build with MKL support.
   --config=monolithic  	# Config for mostly static monolithic build.
Configuration finished

I kept -march=native since on my machine that will give me AVX512 and FMA3.

10) Build TensorFlow

After you are finished with configure you can do the build. I used,

bazel build --config=opt --config=mkl --config=cuda --action_env PATH="$PATH"  //tensorflow/tools/pip_package:build_pip_package

Note that in addition to "opt" and "cuda" I used --config=mkl that will cause the build to link in the Intel MKL-ML libs. Those libs are now included in the TensorFlow source tree. (Thank you Intel for making those important libraries available to the public.)

Also note, I had to add --action_env PATH="$PATH" because bazel sometimes forgets it's environment!

It will take some time to build since TensorFlow is a big package! I was greeted with the wonderful message,

INFO: Elapsed time: 891.344s, Critical Path: 501.24s
INFO: Build completed successfully, 6853 total actions

11) Create the pip package

After your build finishes you will want to create the pip package,

bazel-bin/tensorflow/tools/pip_package/build_pip_package ../tensorflow_pkg

You should now have a "whl" file in your TF-build-gpu/tensorflow_pkg directory. You can install that pip package in a conda environment on your local machine if you have Anaconda installed there. This is what I was planning on for the build. I also want this pip package for use in other Docker containers.

12) Install the pip package

I'll test in the current Docker container. First create a conda env.

conda create -n tftest

source activate tftest

pip install tensorflow_pkg/tensorflow-1.7.0-cp36-cp36m-linux_x86_64.whl

The first thing I want to see is the linked in libraries. I used "ldd" to check that,

ldd ~/anaconda3/lib/python3.6/site-packages/tensorflow/libtensorflow_framework.so
	linux-vdso.so.1 =>  (0x00007fff5e9fe000)
	libcublas.so.9.1 => /usr/local/cuda-9.1/targets/x86_64-linux/lib/libcublas.so.9.1 (0x00007f7687f0c000)
	libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f768736c000)
	libcudnn.so.7 => /usr/lib/x86_64-linux-gnu/libcudnn.so.7 (0x00007f7672b43000)
	libcufft.so.9.1 => /usr/local/cuda-9.1/targets/x86_64-linux/lib/libcufft.so.9.1 (0x00007f766b656000)
	libcurand.so.9.1 => /usr/local/cuda-9.1/targets/x86_64-linux/lib/libcurand.so.9.1 (0x00007f76676d3000)
	libcudart.so.9.1 => /usr/local/cuda-9.1/targets/x86_64-linux/lib/libcudart.so.9.1 (0x00007f7667465000)
	libmklml_intel.so => /root/anaconda3/lib/python3.6/site-packages/tensorflow/../_solib_local/_U_S_Sthird_Uparty_Smkl_Cintel_Ubinary_Ublob___Uexternal_Smkl_Slib/libmklml_intel.so (0x00007f765e622000)
	libiomp5.so => /root/anaconda3/lib/python3.6/site-packages/tensorflow/../_solib_local/_U_S_Sthird_Uparty_Smkl_Cintel_Ubinary_Ublob___Uexternal_Smkl_Slib/libiomp5.so (0x00007f765e27e000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f765e07a000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f765dd71000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f765db54000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f765d7d2000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f765d5bc000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f765d1f2000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f768caa4000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f765cfea000)
	libnvidia-fatbinaryloader.so.390.48 => /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.390.48 (0x00007f765cd9e000)

Success! Linked to all of the libraries I wanted.

Happy computing! --dbk

Tags: TensorFlow, CUDA, Intel MKL, Anaconda Python, Docker
lowe3

Thank you for an excellent series of posts. I have successfully worked my way through the first 4 pages and am now at step 10 where I receive the following error:

'... //tensorflow/tools/pip_package:build_pip_package failed to build...'

with a mention of

'Xbyak::Error'

Does this mean anything to you?

Posted on 2018-04-24 20:45:56
Donald Kinghorn

TensorFlow was one of the more flaky builds that I've done for a long time. I think most of the problems people have come from Bazel. Be certain to check your bazel version. 11.1 worked but 12 failed. I did comment on a thread on github about it. That thread got pretty long. I would suggest you first look at that

https://github.com/tensorfl...

It could be something else too! In fact I just took a look this might be coming from MKL-DNN. I've never seen it that I know of. I would try to build the package without --config-mkl and see if it works OK [ are you using a Ryzen CPU? I see some some people were having trouble with that together with MKL-DNN]

Yes. it looks like this is called/linked in MKL-DNN mkl_dnn/src/cpu/xbyak

My advise would be to leave out the MKL config i.e. not use --config-mkl If you are running on an NVIDIA GPU the CUDA cuDNN is going to provide most of your performance. I feel that it is nice to have MKL linked in for times when you want to just run on CPU but the GPU is generally going to be faster.

... still, it should build ... I saw there were some problems building CNTK with the same kind of issue. The CNTK maintainers fixed it by updating the libis they included in their source. If you want to try an experiment you could see if TF 1.8 will build correctly in the container. It looks like it will build with bazel 12 also ...

Posted on 2018-04-25 15:26:40
lowe3

Donald,

Building without the --config-mkl flag in step #10 worked!

Yes I am using a Ryzen CPU (Threadripper 1950x).

Thank you again for your time.
-Tripp

Posted on 2018-04-25 17:29:13
Saiteja Dommeti

"
Hello Donald Kinghorn , I was recently trying to build Tensorflow for my RTX 2080 with cuda 10 and cudnn7 but there appears a "hash mismatch" which doesn't let the docker build pass. I have almost tried many solutions and fixes available only to be disappointed.

"Failed to fetch http://archive.ubuntu.com/u... Hash Sum mismatch
E: Some index files failed to download. They have been ignored, or old ones used instead."

This is the error I have faced during the " docker build -t tf-build-1.7-gpu " step. Could you show some light !

Thanks in advance !

Posted on 2018-10-06 07:36:18
Donald Kinghorn

I haven't tried this yet myself. The hash mismatch sound like a key error ... maybe?? l have been using the TF 1.10 container built against CUDA 10.0 that is on nvidia's NGC. My local system is Ubuntu 18.04 with the NV driver 410.

My best advise at this point is to do what I would do to start with. Look at the Dockerfile sources that are on dockerhub for cuda
https://hub.docker.com/r/nv...

The first thing is to get a new Dockerfile put together to build your container and for that deciding which base to start from. Just decide on what you for the first FROM line and then take one small step at a time adding things in and testing. You will probably want to using the latest TF source and the latest Bazel and you will have to have a new NVIDIA driver on your host machine. CUDA 10 requires the NVIDIA display driver 410 or greater. At thin point the 410 version is not in the graphics-drivers ppa

Actually, the fist thing is to have is to have your host system up-to-date with a new driver, you will need nvidia-docker2 also ... Right now the easiest way to do the 410 driver install is to do a CUDA 10 install since it comes with the 410 driver.

It might be best to wait a little before you try to do this (if you can away with waiting). I will probably wait until the 410 (or greater) driver is in the ppa's. It's beta right now. In the mean time you may be able to use a container image from the NGC registry. See my recent post https://www.pugetsystems.co... --Don

Posted on 2018-10-08 19:02:27
Saiteja Dommeti

Thank you Mr.Donald I did try than and still end up in the hash mismatch.

Posted on 2018-10-08 19:08:16