TensorFlow Scaling on 8 1080Ti GPUs – Billion Words Benchmark with LSTM on a Docker Workstation Configuration



I do most of my system testing now using docker and I just finished some work on a system that had 8 NVIDIA 1080Ti GPU’s in it. This will be an example usage of the docker configuration I described in the series of posts “How-To Setup NVIDIA Docker and NGC Registry on your Workstation – Parts 1-5” (I’ll give links to those posts below).

I’ll be using the TensorFlow image from the NVIDIA NGC docker registry to look at the multi-GPU scaling of TensorFlow. In particular I’ll be running BIGLSTM for the “One Billion Words Benchmark” (see the paper Exploring the Limits of Language Modeling). The NVIDIA NGC docker image for TensorFlow includes a very nice implementation of this benchmark that was done by Oleksii Kuchaiev, one of the excellent machine learning researchers at NVIDIA. The source code for his work is on this GitHub page.


OS Install, Docker, NVIDIA Docker and NGC Configuration

The system configuration was a docker workstation setup as described in the following posts,

These posts provide very detailed explanations of how I setup Docker for a workstation. The procedures in those post can be mostly scripted and the entire install and configuration can typically be completed in less than an hour including the time for the desktop OS packages to download.


Hardware

The relevant components of the system under test was as follows,

  • Mother board: TYAN S7109GM2NR-2T [Dual root complex with 4 PLX PEX8747 PCIe switches] In chassis B7109F77DV14HR-2T-N
  • CPU’s: 2 x Intel Xeon Scalable Platinum 8180 CPU @ 2.50GHz 28-Core
  • Memory: 768GB DDR4 REG ECC 12 x 64GB 2666MHz
  • GPU’s: 8 x NVIDIA 1080Ti

Maybe I could get away with calling this a super workstation on lots of steroids!


Running the Billion Words Benchmark

My post with the subtitle “Part 4 Accessing the NGC Registry” gives information about getting access to the registry. For this testing I used a container instance from the TensorFlow image from NGC. The image tag was “18.01-py2”. Yes, Python 2. I had trouble getting the benchmark to work correctly whit the Python 3 image. There is a directory in that image nvidia-examples\big_lstm that has the the benchmark scripts. There is convenient script for downloading the dataset.

I start the container with the following command,

docker run --runtime=nvidia --rm -it -v $HOME/projects:/projects nvcr.io/nvidia/tensorflow:18.01-py2

My home/projects directory is bound to /projects in the container. I copy the nvidia-examples/big_lstm directory there and run the download_1b_words_data.sh script to get the dataset which is around 4GB in size. This way the scripts and dataset are on my host filesystem rather then just in the container. There is a good README file in the big_lstm directory that you should read if you are going to run this yourself.

LSTM models can take a very long time to train (weeks) so I have the environment variable MAX_TRAIN_TIME_SEC=240 set for the benchmark run.

The command to run the benchmark is,

python single_lm_train.py --mode=train --logdir=/logs --num_gpus=8
--datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig
run_profiler=False,max_time=${MAX_TRAIN_TIME_SEC},num_steps=20,num_shards=8,
num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,
projected_size=1024,state_size=8192,num_sampled=8192,batch_size=448

Note: that should all be on one line. I broke it up to display in this post so don’t just cut and paste.

The settings that were changed for the job runs were,

  • --num_gpus=8 Varied over 1,2,4,6,8
  • --datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ which is where I downloaded the dataset.
  • batch_size=448 The batch size listed for the run command in the README file was 512 but that turned out to be too large for the limited memory on the 1080Ti GPU’s. 448 was as large as I could use without memory overruns. Performance did increase as I varied the batch size from 256 to 448.

The values from the “training” runs that I am collecting as a benchmark are wps, “words per second”, taking the maximum value during the job run iterations as the reported value.


Results

The results for this system are very good. Better than expected.

There is clearly an advantage to using multiple GPU’s for this machine learning task.

I did not record or report the “perplexity” (the loss function) which is the typical measure of “goodness of fit” for this type of model. It would likely take weeks of training time to get a low perplexity even with the incredibly good performance of the system I’m testing on.

Billion Words Benchmark model train, LSTM with TensorFlow and 1-8 1080Ti GPU’s

Number of GPU’s Words Per Second Performance Increase % Efficiency
1 5884 1 100%
2 11364 1.93 96.7%
4 19485 3.31 82.8%
6 27871 4.74 78.9%
8 31528 5.36 67.0%

The scaling for two GPU’s is near perfect and scaling remains fairly good out to 6 GPU’s.

Fitting the data to an Amdhal’s Law curve gives a parallel fraction of P = 0.935. [That means that the maximum speedup achievable is unlikely to exceed 1/(1-P) = 15.3 with any number of GPU’s in the system.]

Here’s the expression of Amdhal’s Law that I did a regression fit of the data to,

performance_wps = 5584_wps/((1-P)+(P/num_GPU's))

Below is a plot of that curve, the measured “words per second” from the model training runs and a line that would represent “perfect” linear scaling.

GPU's vs Words Per Second

I think the table and plot speak for themselves!

Happy computing –dbk