Puget Systems print logo
Read this article at https://www.pugetsystems.com/guides/1122
Dr Donald Kinghorn (Scientific Computing Advisor )

TensorFlow Scaling on 8 1080Ti GPUs - Billion Words Benchmark with LSTM on a Docker Workstation Configuration

Written on March 2, 2018 by Dr Donald Kinghorn

I do most of my system testing now using docker and I just finished some work on a system that had 8 NVIDIA 1080Ti GPU's in it. This will be an example usage of the docker configuration I described in the series of posts "How-To Setup NVIDIA Docker and NGC Registry on your Workstation - Parts 1-5" (I'll give links to those posts below).

I'll be using the TensorFlow image from the NVIDIA NGC docker registry to look at the multi-GPU scaling of TensorFlow. In particular I'll be running BIGLSTM for the "One Billion Words Benchmark" (see the paper Exploring the Limits of Language Modeling). The NVIDIA NGC docker image for TensorFlow includes a very nice implementation of this benchmark that was done by Oleksii Kuchaiev, one of the excellent machine learning researchers at NVIDIA. The source code for his work is on this GitHub page.

OS Install, Docker, NVIDIA Docker and NGC Configuration

The system configuration was a docker workstation setup as described in the following posts,

These posts provide very detailed explanations of how I setup Docker for a workstation. The procedures in those post can be mostly scripted and the entire install and configuration can typically be completed in less than an hour including the time for the desktop OS packages to download.


The relevant components of the system under test was as follows,

  • Mother board: TYAN S7109GM2NR-2T [Dual root complex with 4 PLX PEX8747 PCIe switches] In chassis B7109F77DV14HR-2T-N
  • CPU's: 2 x Intel Xeon Scalable Platinum 8180 CPU @ 2.50GHz 28-Core
  • Memory: 768GB DDR4 REG ECC 12 x 64GB 2666MHz
  • GPU's: 8 x NVIDIA 1080Ti

Maybe I could get away with calling this a super workstation on lots of steroids!

Running the Billion Words Benchmark

My post with the subtitle "Part 4 Accessing the NGC Registry" gives information about getting access to the registry. For this testing I used a container instance from the TensorFlow image from NGC. The image tag was "18.01-py2". Yes, Python 2. I had trouble getting the benchmark to work correctly whit the Python 3 image. There is a directory in that image nvidia-examples\big_lstm that has the the benchmark scripts. There is convenient script for downloading the dataset.

I start the container with the following command,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.01-py2

My home/projects directory is bound to /projects in the container. I copy the nvidia-examples/big_lstm directory there and run the download_1b_words_data.sh script to get the dataset which is around 4GB in size. This way the scripts and dataset are on my host filesystem rather then just in the container. There is a good README file in the big_lstm directory that you should read if you are going to run this yourself.

LSTM models can take a very long time to train (weeks) so I have the environment variable MAX_TRAIN_TIME_SEC=240 set for the benchmark run.

The command to run the benchmark is,

python single_lm_train.py --mode=train --logdir=/logs --num_gpus=8
--datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig

Note: that should all be on one line. I broke it up to display in this post so don't just cut and paste.

The settings that were changed for the job runs were,

  • --num_gpus=8 Varied over 1,2,4,6,8
  • --datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ which is where I downloaded the dataset.
  • batch_size=448 The batch size listed for the run command in the README file was 512 but that turned out to be too large for the limited memory on the 1080Ti GPU's. 448 was as large as I could use without memory overruns. Performance did increase as I varied the batch size from 256 to 448.

The values from the "training" runs that I am collecting as a benchmark are wps, "words per second", taking the maximum value during the job run iterations as the reported value.


The results for this system are very good. Better than expected.

There is clearly an advantage to using multiple GPU's for this machine learning task.

I did not record or report the "perplexity" (the loss function) which is the typical measure of "goodness of fit" for this type of model. It would likely take weeks of training time to get a low perplexity even with the incredibly good performance of the system I'm testing on.

Billion Words Benchmark model train, LSTM with TensorFlow and 1-8 1080Ti GPU's

Number of GPU'sWords Per SecondPerformance Increase% Efficiency
15884 1 100%

The scaling for two GPU's is near perfect and scaling remains fairly good out to 6 GPU's.

Fitting the data to an Amdhal's Law curve gives a parallel fraction of P = 0.935. [That means that the maximum speedup achievable is unlikely to exceed 1/(1-P) = 15.3 with any number of GPU's in the system.]

Here's the expression of Amdhal's Law that I did a regression fit of the data to,

performance_wps = 5584_wps/((1-P)+(P/num_GPU's))

Below is a plot of that curve, the measured "words per second" from the model training runs and a line that would represent "perfect" linear scaling.

GPU's vs Words Per Second

I think the table and plot speak for themselves!

Happy computing --dbk

Tags: Docker, TensorFlow, NVIDIA, Linux, NGC

Dr. Kinghorn, thanks for the review.
I do have to ask, 2x Xeon 8180's, 768GB's of RAM, and eight GTX 1080's, do you guys have any plans to test something with a bit more computing power than contained in a typical, residential neighborhood?

Posted on 2018-03-03 03:30:32
Donald Kinghorn

you are welcome! that was a fun system for work on!
Your question is interesting because I have worked on big clusters that required more "power than contained in a typical residential neighborhood", but I don't think that's what you are really asking :-) The box was at first plugged into a cheap 15Amp power-strip and I was working remotely and sure enough it tripped when I ran the harder jobs. The guys plugged it into two lines after that and it was OK. ... 8 GPU's can pull a lot of power under load!

For our rack gear we generally assume that it is going somewhere with facilities to handle the load. That system I was testing would have been more "efficient" on a 240V or 206V line and the power-supplies were capable of that.

For a residential 15-20A 110V line a dual socket sys with 4 GPU's is usually OK but it is getting near where you start wanting to ask "what else is going to be running on that line?" I overloaded a 1500VA UPS (~900W) with a 4 GPU box the other day. It stayed up but the UPS battery was not able to supply all the current and it tripped the load alarm.

I have a system design limit that I always like to keep in mind ... "how much compute capability can you get on a 20Amp line?" That is usually somewhere around 4-6 dual socket CPU based nodes or maybe just 1 4xGPU based systems. Going over a kilo-watt on 1 line is iffy ... and you have to remember you have to get rid of that heat too i.e. power use for room cooling ...

Posted on 2018-03-19 16:45:22