Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1167
Dr Donald Kinghorn (Scientific Computing Advisor )

PCIe X16 vs X8 with 4 x Titan V GPUs for Machine Learning

Written on May 21, 2018 by Dr Donald Kinghorn
Share:


PCIe X16 vs PCIe X8 when doing GPU accelerated computing ... How much difference does it make? I get asked that question a lot. The answer, of course is, "it depends". A little better answer is that it probably wont have much effect for real world applications.

In this post I'll be looking at some common Machine Learning (Deep Learning) application job runs on a system with 4 Titan V GPU's. I will repeat the job runs at PCIe X16 and X8.

I looked at this issue a couple of years ago and wrote it up in this post, PCIe X16 vs X8 for GPUs when running cuDNN and Caffe. I've been doing a lot more Machine Learning work since then and the question of PCIe bandwidth limitations keeps coming up. The machines I run jobs on always have their GPU's running at X16 either from lanes provided by the CPU and chip-set or with the help of PLX (PEX) PCIe switches. On current Intel single socket systems a PCIe switch is required to get 4 full X16 slots for GPU's. The alternative possibility for 4 GPU's on boards that don't have the PCIe switches is to run the cards at X8 (assuming the motherboard has enough slots to support that). The question is how much performance do you lose from doing that? Lets find out.


System configuration

The system I'm using for this testing is the same system I've used recently for tests with Titan V. The recent post on multi-GPU scaling has more details about the configuration and setup, Multi-GPU scaling with Titan V and TensorFlow on a 4 GPU Workstation.

Hardware

System under test,

  • Gigabyte motherboard with 4 X16 PCIe sockets (1 PLX switch on sockets 2,3)
  • Intel Xeon W-2195 18 core (Skylake-W with AVX512)
  • 256GB Reg ECC memory (up to 512GB)
  • 4 x NVIDIA Titan V GPU's
  • Samsung 256GB NVMe M.2

Software

  • Ubuntu 16.04
  • Docker 18.03.0-ce
  • NVIDIA Docker V2
  • TensorFlow 1.7 (running on NVIDIA NGC docker image)
  • Keras 2.1.5 (with TensorFlow back-end from local install with Anaconda Python).

"Sticky Note" method for switching from X16 to X8

I order to keep everything about the system under test the same the PCIe bandwidth was physically changed on the GPU's by blocking of half the pins on the cards with cut down "sticky notes". Don't try this at home! This is equivalent to putting the cards into X8 slots. The following photo shows 4 Titan V GPU's with half of the pins blocked off.

Titan V with sticky notes Bar

Test Jobs

Performance was tested with a mixture of benchmark jobs, a "real-world" model training job and, for completeness, a per-to-per bandwidth and latency test.

GoogLeNet Training

This job was run using the TensorFlow docker image on NVIDIA NGC. The application is cnn in the nvidia-examples directory. And, I am using synthetic data for image input. For details on this doing this job run see, NVIDIA Titan V plus Tensor-cores Considerations and Testing of FP16 for Deep Learning. [Note: this job shows clear benefit from using Tensor-cores so that data will be included in the results. ]

Billion Words Benchmark LSTM Train

This job is also using the docker image mentioned above. The application is big_lstm using the billion word news-feed corpus. See, TensorFlow Scaling on 8 1080Ti GPUs - Billion Words Benchmark with LSTM on a Docker Workstation Configuration for example usage.

VGG model in Keras

For a "real-world" test I did an implementation of the VGG CNN using Keras (with the TensorFlow back-end). This was run in a Jupyter notebook with Anaconda Python installed on the machine. The data used is 25000 images of "dogs vs cats" downloaded from Kaggle. The source is included in appendix B.

Per-to-Per bandwidth and latency

The p2pBandwidthLatencyTest from the NVIDIA CUDA samples was used to show the direct bandwidth halving effect of moving from X16 to X8. The output is in appendix A.


Results

The following tables and charts give a pretty clear indication that X16 has only a small advantage over X8. In some cases there is no apparent difference. This does not mean that this will always be the case! These jobs achieve most of their parallelism by distributing batches across the GPU's so there is very little card-to-card communication. This means that most of the effect of the lower bandwidth at X8 is going to occur during the transfer of data from CPU space to GPU space. How much that effect your particular job will vary. However, these results suggest that the effect may be modest in most cases.

PCIe X16 vs X8 -- GoogLeNet Training, TensorFlow FP32 and FP16 (Tensor-cores)
Titan V GPU's
[Images/second (total batch size)]

Number of GPU's PCIe X16
FP32
PCIe X16
FP16
PCIe X8
FP32
PCIe X8
FP16
1 851.3 (256)1370.6 (512) 8381319
21525.1 (512)2517.0 (1024)15192424
32272.3 (768)3661.3 (1536)21533572
43080.2 (1024)4969.6 (2048)29434707


GoogLeNet Bar

The code for this job run is highly optimized for GPU and there is only a minor difference between X16 and X8. The difference does increase with more GPU's. This chart also shows the nice improvement from using Tensor-cores (FP16) on the Titan V!


PCIe X16 vs X8 -- Billion Words Benchmark LSTM Train, TensorFlow
Titan V GPU's
[Words per second]

Number of GPU's PCIe X16PCIe X8
18373 8176
215483 14686
320462 19178
422058 20565


Big LSTM

Again, the difference between X16 and X8 is only minor and increasing with more GPU's.


PCIe X16 vs X8 -- VGG, Keras (TensorFlow) Memory-streaming 25000 Images
Titan V GPU's
[Training time for 4 epochs (seconds)]

Number of GPU's PCIe X16PCIe X8
1476 476
2262 269
3222 202
4188 195


Keras memory stream

In this Keras implementation of VGG there is even less performance difference between X16 and X8. The multi-GPU scaling beyond 2 GPU's is also not as good as the previous jobs. The image data was loaded into memory and fed to the model through Python variables. The Python source that was used for this job is given in Appendix B.


PCIe X16 vs X8 -- VGG in Keras (TensorFlow) Disk-streaming 25000 Images
Titan V GPU's
[Training time for 4 epochs]

Number of GPU's PCIe X16PCIe X8
1445 445
2320 314
3316 318
4316 316


Keras disk stream

For this job run the image data was fed to the model in batches loaded from disk and put into the proper format on-the-fly. This caused multi-GPU scaling to fall off completely after 2 GPU's. And, again, dropping to X8 had almost no effect on the performance. The real bottleneck here is the time taken to convert image data pulled from disk. [Note: the data was stored on a fast NVMe SSD.]


There you have it, PCIe X16 vs PCIe X8. Not a lot of difference from what I've seen from the testing above. Still, I think I would always prefer to have X16 for my GPU's!

What happens if you drop even further? I don't really know but I suspect the X4 may still be OK. If you are thinking about turning your X1 "coin mining rig" into a machine learning box I think it would probably work OK. If you try that let me know!

Happy computing! --dbk


Appendix A: Peer to peer bandwidth and latency test results

For completeness, I wanted to include the results from running p2pBandwidthLatencyTest (source available from "CUDA samples" )

The bandwidth and latency for this test system look very good. You do some expected bandwidth lowering and latency increase across devices 2 and which are on the PLX switch.

PCIe X16PCIe X8
./p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, TITAN V, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, TITAN V, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 2, TITAN V, pciBusID: b5, pciDeviceID: 0, pciDomainID:0
Device: 3, TITAN V, pciBusID: b6, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one,
 it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0	     1     1     1     1
     1	     1     1     1     1
     2	     1     1     1     1
     3	     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 555.65   5.75   5.74   5.76
     1   5.86 554.87   5.72   5.74
     2   5.87   5.87 554.87   5.77
     3   5.75   5.76   5.81 555.65
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 554.87   6.03   6.02   6.02
     1   6.05 554.87   6.01   6.03
     2   4.39   4.39 554.08   4.26
     3   4.40   4.40   4.27 553.29
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 566.53  11.05  11.03  10.99
     1  11.07 564.49  11.15  10.92
     2  11.15  11.16 562.86   6.18
     3  11.05  11.01   6.19 564.49
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 564.05  10.83   8.40   8.43
     1  10.84 563.67   8.40   8.43
     2   8.40   8.39 564.90   8.07
     3   8.43   8.42   8.11 563.67
P2P=Disabled Latency Matrix (us)
   D\D     0      1      2      3
     0   2.94  16.23  16.50  16.51
     1  16.41   3.03  16.53  16.52
     2  17.40  17.54   3.74  18.90
     3  17.42  17.36  18.73   3.08
P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   3.01   5.38   5.55   5.57
     1   5.26   3.00   5.79   5.80
     2   6.81   6.91   3.00   6.64
     3   6.81   6.89   6.62   3.01

./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, TITAN V, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, TITAN V, pciBusID: 65, pciDeviceID: 0, pciDomainID:0
Device: 2, TITAN V, pciBusID: b5, pciDeviceID: 0, pciDomainID:0
Device: 3, TITAN V, pciBusID: b6, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one,
 it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0	     1     1     1     1
     1	     1     1     1     1
     2	     1     1     1     1
     3	     1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 556.45   3.07   3.04   3.04
     1   3.04 552.51   3.04   3.04
     2   3.15   3.15 554.87   3.16
     3   3.15   3.15   3.15 553.29
Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 555.65   5.18   5.17   5.17
     1   5.18 554.87   5.18   5.17
     2   4.22   4.23 555.65   4.03
     3   4.26   4.26   4.07 554.08
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 563.67   5.99   5.87   5.88
     1   6.02 562.46   5.89   5.89
     2   5.91   5.86 561.24   5.98
     3   5.87   5.84   5.98 562.05
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 565.30   8.21   7.27   7.31
     1   8.21 563.27   7.26   7.30
     2   7.26   7.26 562.86   6.21
     3   7.30   7.31   6.23 562.86
P2P=Disabled Latency Matrix (us)
   D\D     0      1      2      3
     0   2.97  15.87  16.50  16.44
     1  15.86   2.99  16.42  16.42
     2  17.40  17.18   2.99  18.50
     3  17.31  17.32  18.47   2.96
P2P=Enabled Latency Matrix (us)
   D\D     0      1      2      3
     0   2.98   5.42   5.90   5.91
     1   5.46   3.02   5.57   5.62
     2   7.03   6.76   3.06   6.66
     3   6.82   6.73   6.65   3.04

Appendix B: VGG model train on cat vs dog image set (25000 images)

import numpy as np
np.random.seed(42)
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from keras.layers.normalization import BatchNormalization
from keras.callbacks import TensorBoard

from keras.preprocessing.image import ImageDataGenerator

from keras.utils import multi_gpu_model
model = Sequential()

model.add(Conv2D(64, 3, activation='relu', input_shape=(224,224,3)))
model.add(Conv2D(64, 3, activation='relu'))
model.add(MaxPooling2D(2,2))
model.add(BatchNormalization())

model.add(Conv2D(128, 3, activation='relu'))
model.add(Conv2D(128, 3, activation='relu'))
model.add(MaxPooling2D(2,2))
model.add(BatchNormalization())

model.add(Conv2D(256, 3, activation='relu'))
model.add(Conv2D(256, 3, activation='relu'))
model.add(Conv2D(256, 3, activation='relu'))
model.add(MaxPooling2D(2,2))
model.add(BatchNormalization())

model.add(Conv2D(512, 3, activation='relu'))
model.add(Conv2D(512, 3, activation='relu'))
model.add(Conv2D(512, 3, activation='relu'))
model.add(MaxPooling2D(2,2))
model.add(BatchNormalization())

model.add(Conv2D(512, 3, activation='relu'))
model.add(Conv2D(512, 3, activation='relu'))
model.add(Conv2D(512, 3, activation='relu'))
model.add(MaxPooling2D(2,2))
model.add(BatchNormalization())

model.add(Flatten())
model.add(Dense(4096, activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='tanh'))
model.add(Dropout(0.5))

model.add(Dense(1, activation='sigmoid'))

model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 222, 222, 64)      1792      
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 220, 220, 64)      36928     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 110, 110, 64)      0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 110, 110, 64)      256       
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 108, 108, 128)     73856     
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 106, 106, 128)     147584    
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 53, 53, 128)       0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 53, 53, 128)       512       
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 51, 51, 256)       295168    
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 49, 49, 256)       590080    
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 47, 47, 256)       590080    
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 23, 23, 256)       0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 23, 23, 256)       1024      
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 21, 21, 512)       1180160   
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 19, 19, 512)       2359808   
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 17, 17, 512)       2359808   
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 8, 8, 512)         0         
_________________________________________________________________
batch_normalization_4 (Batch (None, 8, 8, 512)         2048      
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 6, 6, 512)         2359808   
_________________________________________________________________
conv2d_12 (Conv2D)           (None, 4, 4, 512)         2359808   
_________________________________________________________________
conv2d_13 (Conv2D)           (None, 2, 2, 512)         2359808   
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 1, 1, 512)         0         
_________________________________________________________________
batch_normalization_5 (Batch (None, 1, 1, 512)         2048      
_________________________________________________________________
flatten_1 (Flatten)          (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 4096)              2101248   
_________________________________________________________________
dropout_1 (Dropout)          (None, 4096)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 4096)              16781312  
_________________________________________________________________
dropout_2 (Dropout)          (None, 4096)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 4097      
=================================================================
Total params: 33,607,233
Trainable params: 33,604,289
Non-trainable params: 2,944
_________________________________________________________________
#parallel_model = multi_gpu_model(model, gpus=2)
#parallel_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Setup data

tensorbrd = TensorBoard('./logs/vgg-1')
batch_size = 64
train_dir= './train'
img_size = (224,224)

train_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        train_dir,
        target_size=img_size,
        #batch_size=batch_size,
        # the following batch size loads all 25000 images at once
        batch_size=25000,
        class_mode='binary')
Found 25000 images belonging to 2 classes.
for X, Y in train_generator:
    print('data X shape:', X.shape)
    print('labels Y shape:', Y.shape)
    break
data X shape: (25000, 224, 224, 3)
labels Y shape: (25000,)
#parallel_model.fit_generator(train_generator, epochs=4, verbose=1, callbacks=[tensorbrd],
#                             steps_per_epoch=25000 // batch_size,
#                             max_queue_size=25000, workers=16)
#model.fit_generator(train_generator, epochs=4, verbose=1, callbacks=[tensorbrd])
#parallel_model.fit(X,Y, batch_size=batch_size, epochs=4, verbose=1, callbacks=[tensorbrd])
model.fit(X,Y, batch_size=batch_size, epochs=4, verbose=1, callbacks=[tensorbrd])
Tags: PCIe, Titan V, Machine Learning, TensorFlow, Keras
lemans24

Excellent tests regarding 16pcie vs 8pcie lanes...thanks for you info but I think these tests fall up short because 99.99% of apps currently do NOT transfer more than 8 GB in a SINGLE Pcie data transfer. The theoretical limit for 8pcie lanes is approx 8GB/s so unless the app makes memory allocations greater than 8GB then you will never hit the real limit between 16pcie and 8pcie lanes.

I have created a CUDA based monte carlo simulation and I allocate all 12 GB on a Titan XP or 10GB on a 1080Ti and when I do transfers greater than 8 GB of data, I do see a significant time in transferring data between gpu card and the host. In my situation since i know exactly how much memory i use when transferring data at any single time, means that the difference between 16pcie and 8pcie will definitely be known.

I am struggling to define the requirements of my gpu server as 3 Titan xp cards in a real 16pcie lane motherbaord (no plx switches) maybe just as fast as 4 Titan xp cards using 8pcie lanes. Supermicro has 2 excellent server motherboards that I am considering: C9X299-RPGF with 4 x 8pcie lanes, H11SSL-NC with 3 x 16 pcie lanes.

My gut feeling is that 4 gpu cards would be faster processing large batches even when transferring over 8GB's of data vs 3 gpu cards in the other motherboard as the x299 board allows me to run a much faster cpu board and I still do all my summing and averaging with the cpu.

I am stressing nearly every component in my pc for as much real-time performance as possible...

Posted on 2018-05-22 14:50:10
Donald Kinghorn

Thanks for posting your comment! I hear you. That 99.99% comment is my feeling too but, yes, when you need the bandwidth you really want to have it.

I was expecting that by using the maximum batch sizes I could fit on the GPU's that I would see an effect form the different bandwidths but I really didn't see it. I increased batch size until I got memory exhausted errors from the GPU's. I just realized I did not do that with the Keras VGG job runs. On the GoogLeNet training the batch size is in parenthesis and those are as large as I could go for the given number of GPU's and precision. In that code the the parallelism is from dividing up the given batch size across the GPU's. In Keras the given batch size is used on 'each' GPU.

I really expected to see more performance hit on X8!

I don't doubt that your monte carlo simulations will stress the bandwidth. I think going with a 3 GPU setup on all X16 (no PLX) is a good idea. I would expect you to get very good utilization of the hardware. The X299 boards are in general pretty good and the Skylake-X i9 CPU's are nice!

Posted on 2018-05-24 00:59:28
lemans24

Just saw this: an AMD epyc 4 pcie x 16 motherboard
https://www.anandtech.com/s...

Hopefully you can find out more info and do a review as this is definitely THE type of motherboard to get to properly run a 4 gpu deep learning/ hpc system

Posted on 2018-06-21 01:48:21
Donald Kinghorn

Yes, I would really like to do some testing with EPYC and that looks like an appropriate board. We have been waiting for ages for AMD to get stuff to us! ... and there hasn't been a lot of interesting boards either. We have give board makers feedback from the start that single socket + 4 x X16 boards could be a hit with HPC and ML crowd but they have been very slow to respond. I'll bug a couple folks here and see if I can get an update on status.

I should be doing some testing on a dual Xeon-SP system (Workstation class) but again we've had trouble with the boards. Both "Purly" and EPYC have been the slowest roll out I've ever seen for CPU platforms. The stuff is out there but supply and motherboards are a mess.

Posted on 2018-06-21 14:54:40
Mark Johnstone

Great test; it is nice to see some actual data on this topic. I'm also interested to see how much of an affect, if any, system RAM bandwidth has on deep-learning applications. Have you considered running a test or two comparing DDR4-2666 vs. something faster like DDR4-3200 or above? I suspect the difference would be small given that you are probably limited by PCI bandwidth, but with 4 GPUs using up 32-lanes, it might not be zero.

I enjoy reading your blogs, and was happy I got to meet you at GTC.

Posted on 2018-06-25 19:58:05
Donald Kinghorn

Thanks Mark!
Data movement in CPU-GPU space is really important. It's hard to test because of all the factors that can have an effect. I really expected to see more difference with this X8 vs X16 testing! Memory clocks could certainly have performance impact. I don't know if I get to test that or not ... it would be interesting. I could see doing something simple like running tests with mem at 2666MHz and then clocking that down in the BIOS to see when it becomes a limiting factor ...

I hope to be testing soon with a dual Xeon setup and 4 GPU's. That should expose some interesting mem transfer patterns too because of the split across the CPU's ...

Posted on 2018-06-26 18:16:39
Jennifer Amb

Just found this.... Interesting !
Hope you still read this.
Did you monitor the system under test for other bottlenecks than GPU?
GPU bandwidth will matter little if there is any other bottleneck in the system. CPU, RAM bandwidth and storage are obvious candidates to look out for.
For example, it would be interesting to see a CPU graph for the test run.

Posted on 2018-07-12 17:45:02
Donald Kinghorn

These results were surprising. The main thing I did to try to keep other factors from coming into play was this,
train_generator = train_datagen.flow_from_directory(
train_dir,
target_size=img_size,
#batch_size=batch_size,
# the following batch size loads all 25000 images at once
batch_size=25000,
class_mode='binary')
That loads everything into mem up front and that was much faster with way better multi GPU scaling than generating from batch size.

I had limited time for the testing and doing the sticky note thing is kind of a pain (I had to do it twice because I had messed something up on my first run :-)

This kind of stuff is really interesting to do!
Best wishes --Don

Posted on 2018-07-13 22:26:44