Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1093
Dr Donald Kinghorn (Scientific Computing Advisor )

Intel CPU flaw kernel patch effects - GPU compute Tensorflow Caffe and LMDB database creation

Written on January 10, 2018 by Dr Donald Kinghorn
Share:

The Intel CPU flaw and the Meltdown and Spectre exploits have caused a lot of concern about how the "fixes" will impact software performance. There are estimates of 5-50% for worst case scenarios. There will no doubt be lots of testing going on over the next few weeks. I decided to check some Machine Learning frameworks with a couple of large dataset problems that I had used to evaluate the performance of the new NVIDIA Titan V GPU. I was concerned because these problems need to move data from storage to CPU memory space and then to GPU memory. Also, to stress the CPU and system I/O, I did a LMDB database generation on 1.3 millions images from ImageNet. These are potentially the kinds of problems that could be bad cases for the the kernel fixes. I ran these jobs on Ubuntu 16.04 with 2 patched and 2 un-patched kernel versions to see if there was any slowdown.

TLDR: No slowdown on the Tensorflow and Caffe training jobs or the database generation. But, future testing will be needed when all security patches are finished.

The recent revelations about the flaw in Intel CPU's and the story about how it was discovered and exploited is very interesting. It would make a great plot for a movie!

OS kernel developers have been busy making modifications to mitigate the severity of the CPU hardware flaw. Parts of the exploits take advantage of an engineering "tricks" to do speculative code execution and branching to speed up performance. That means those nice tricks to speed things up have to be at least partly circumvented by the fixes i.e. it can lead to a software execution slowdown. There are some worst cases with software that needs to make lots and lots of OS system calls ... but what about "normal" stuff that you do. The only way to know for sure is to run jobs and make comparisons to patched and un-patched kernels.

You will be seeing many software updates over the next few week to work around any exposed exploits and performance losses. System call "latency hiding" is a best practice for software development so I expect that most performance issues that come up will be at least mostly taken care of going forward.


Test Jobs

You can reference my recent post NVIDIA Titan V vs Titan Xp Preliminary Machine Learning and Simulation Tests for information about how I ran the jobs.

The two jobs from the Titan V testing article that I'm rerunning for this post are;

  • Tensorflow LSTM (Train) on 1 Billion Word Benchmark Dataset

  • DIGITS 6.0 with Caffe, GoogLeNet Model Training on 1.3 Million Image Dataset

Additionally, I did a reconstruction of the database from the images I used for training GoogLeNet with Caffe.

  • LMDB database creation from 1.3 million ImageNet images

The hardware I used in this post is different from the earlier Titan V testing.

Test set-up

Kernel versions

  • 4.4.0.87 (default from Ubuntu sever 16.04.3)

  • 4.4.0.104 (normal update from above)

  • 4.4.0.108.131 (patched kernel from 4.4.0 branch)

  • 4.13.0-25.29 (patched HWE kernel)

Note: Ubuntu is maintaining a wiki page with information about the patches for Metltdown and Spectre.


Results

I'm just going to provide a table with the results that I have done as a "quick" test to check for performance problems. It doesn't warrant a bar chart :-)

I was most interested in running test with kernel versions that were as close as possible to each other. It's not unusual for jobs running on kernels that don't have to same major and minor version numbers to differ. To this end the most meaningful comparisona are probably between kernel 4.4.0.104 (un-patched) and 4.4.0.108 (patched for meltdown).

Kernel version Tensorflow LSTM (Train)
1 billion word dataset
Caffe GoogleNet CNN (Train) 1 epoch
ImageNet 1.3 million images
LMDB databse creation
from 1.3 million images
4.4.0.877903 words per second36min 48sec not tested
4.4.0.1047903 words per second36min 45sec 56min 35sec
4.4.0.1087912 words per second37min 31sec 57min 5sec
4.13.0-25.298151 words per second35min 48sec not tested

I my opinion these results are essentially the same. There is usually some variation in timing between job runs even when everything is run "exactly" the same way. The patched and un-patched 4.4.0 kernels look basically the same with just a small slowdown with the patched 4.4.0.108 kernel. It looks like the patched 4.13.0 kernel gives slightly better performance than the older 4.4.0 kernels.

A couple of bugs: Both the patched kernels, 4.4.0.108 and 4.13.0-25.20, hung during shutdown or restart and required hitting the power button to bring them down. There also seemed to be a possible memory allocation problem when I started up the Caffe DNN jobs. It seemed to be a memory capacity miss-read on the Titan V and the job started with a memory allocation greater than the card memory. It took several retries to get jobs started. I'm not sure what the problem really was. I expect these kinds of problems to go away soon.

Bottom line:

There will no doubt be some programs and job runs that will be effected by the patches for Meltdown and Spectre. But, I expect slowdowns to be uncommon. There may be more testing needed after all of the security patches are finished. There will be more kernel and driver patches and, no doubt, many software package and firmware fixes over the next several weeks. Ongoing testing will be needed but I don't expect many dramatic application slowdowns. On good thing from all of this is that it will force developers to refactor their crusty old code to make performance improvements. That's a good thing!

Happy computing --dbk

Tags: Security, Linux, Ubuntu, Intel, Meltdown, Spectre, NVIDIA
Niko Nikolov

I just finished building a nice 7980xe workstation.In order not to lose "some performance" cant i just opt not to install the microsoft patch,bios patch and microcode staff??

Just use a updated antivirus and browser and continue as if nothing hapened?

Posted on 2018-01-12 18:58:06
Donald Kinghorn

... well, it's more complicated than that .. and in any case, the patches are going to pushed automatically since they are security related. I'm not sure why a BIOS patch would be needed?? I haven't looked into that yet. I'm afraid the patches will be unavoidable in a practical sense.

I honestly wouldn't worry much about this. I don't think you will see any performance degradation. It may take a few rounds of patches before everything settles down but in the end it will be OK. Any software that does take a performance hit will probably get fixed. That was what I was talking about in the last line of the post.

Congratulations on that new 7980XE setup! That is an incredibly great processor.
Best wishes --Don

Posted on 2018-01-13 22:31:42