Intel CPU flaw kernel patch effects - GPU compute Tensorflow Caffe and LMDB database creationWritten on January 10, 2018 by Dr Donald Kinghorn
The Intel CPU flaw and the Meltdown and Spectre exploits have caused a lot of concern about how the "fixes" will impact software performance. There are estimates of 5-50% for worst case scenarios. There will no doubt be lots of testing going on over the next few weeks. I decided to check some Machine Learning frameworks with a couple of large dataset problems that I had used to evaluate the performance of the new NVIDIA Titan V GPU. I was concerned because these problems need to move data from storage to CPU memory space and then to GPU memory. Also, to stress the CPU and system I/O, I did a LMDB database generation on 1.3 millions images from ImageNet. These are potentially the kinds of problems that could be bad cases for the the kernel fixes. I ran these jobs on Ubuntu 16.04 with 2 patched and 2 un-patched kernel versions to see if there was any slowdown.
TLDR: No slowdown on the Tensorflow and Caffe training jobs or the database generation. But, future testing will be needed when all security patches are finished.
The recent revelations about the flaw in Intel CPU's and the story about how it was discovered and exploited is very interesting. It would make a great plot for a movie!
OS kernel developers have been busy making modifications to mitigate the severity of the CPU hardware flaw. Parts of the exploits take advantage of an engineering "tricks" to do speculative code execution and branching to speed up performance. That means those nice tricks to speed things up have to be at least partly circumvented by the fixes i.e. it can lead to a software execution slowdown. There are some worst cases with software that needs to make lots and lots of OS system calls ... but what about "normal" stuff that you do. The only way to know for sure is to run jobs and make comparisons to patched and un-patched kernels.
You will be seeing many software updates over the next few week to work around any exposed exploits and performance losses. System call "latency hiding" is a best practice for software development so I expect that most performance issues that come up will be at least mostly taken care of going forward.
You can reference my recent post NVIDIA Titan V vs Titan Xp Preliminary Machine Learning and Simulation Tests for information about how I ran the jobs.
The two jobs from the Titan V testing article that I'm rerunning for this post are;
Tensorflow LSTM (Train) on 1 Billion Word Benchmark Dataset
DIGITS 6.0 with Caffe, GoogLeNet Model Training on 1.3 Million Image Dataset
Additionally, I did a reconstruction of the database from the images I used for training GoogLeNet with Caffe.
LMDB database creation from 1.3 million ImageNet images
The hardware I used in this post is different from the earlier Titan V testing.
Intel Skylake-X 7900X 10-core CPU.
Gigabyte X299 Aorus motherboard
128GB DDR4 2666MHz memory
NVIDIA Titan V
NVIDIA GPU Cloud (NGC) docker registry
220.127.116.11 (default from Ubuntu sever 16.04.3)
18.104.22.168 (normal update from above)
22.214.171.124.131 (patched kernel from 4.4.0 branch)
4.13.0-25.29 (patched HWE kernel)
Note: Ubuntu is maintaining a wiki page with information about the patches for Metltdown and Spectre.
I'm just going to provide a table with the results that I have done as a "quick" test to check for performance problems. It doesn't warrant a bar chart :-)
I was most interested in running test with kernel versions that were as close as possible to each other. It's not unusual for jobs running on kernels that don't have to same major and minor version numbers to differ. To this end the most meaningful comparisona are probably between kernel 126.96.36.199 (un-patched) and 188.8.131.52 (patched for meltdown).
|Kernel version||Tensorflow LSTM (Train) |
1 billion word dataset
|Caffe GoogleNet CNN (Train) 1 epoch|
ImageNet 1.3 million images
|LMDB databse creation|
from 1.3 million images
|184.108.40.206||7903 words per second||36min 48sec||not tested|
|220.127.116.11||7903 words per second||36min 45sec||56min 35sec|
|18.104.22.168||7912 words per second||37min 31sec||57min 5sec|
|4.13.0-25.29||8151 words per second||35min 48sec||not tested|
I my opinion these results are essentially the same. There is usually some variation in timing between job runs even when everything is run "exactly" the same way. The patched and un-patched 4.4.0 kernels look basically the same with just a small slowdown with the patched 22.214.171.124 kernel. It looks like the patched 4.13.0 kernel gives slightly better performance than the older 4.4.0 kernels.
A couple of bugs: Both the patched kernels, 126.96.36.199 and 4.13.0-25.20, hung during shutdown or restart and required hitting the power button to bring them down. There also seemed to be a possible memory allocation problem when I started up the Caffe DNN jobs. It seemed to be a memory capacity miss-read on the Titan V and the job started with a memory allocation greater than the card memory. It took several retries to get jobs started. I'm not sure what the problem really was. I expect these kinds of problems to go away soon.
There will no doubt be some programs and job runs that will be effected by the patches for Meltdown and Spectre. But, I expect slowdowns to be uncommon. There may be more testing needed after all of the security patches are finished. There will be more kernel and driver patches and, no doubt, many software package and firmware fixes over the next several weeks. Ongoing testing will be needed but I don't expect many dramatic application slowdowns. On good thing from all of this is that it will force developers to refactor their crusty old code to make performance improvements. That's a good thing!
Happy computing --dbk