Intel Skylake 6700K with Parallel Studio XE 2016 vs 2015 on Fedora 23 Much Better!

I wrote a blog post a few weeks ago with some initial testing of the new Intel Skylake-S Core-i7 6700K and 6600K processors. The thing in the testing that bothered me the most was the poor performance on the Linpack benchmark. I had expected performance at least as good as the Haswell processors but Skylake was much worse … it just didn’t make sense!

You need to understand that the Linpack benchmark is usually highly optimized for a given hardware platform. Intel puts significant effort into ensuring that Linpack will show off the compute capability of the processor and you can expect to get GFLOP/s numbers that are, at least, 85% of the theoretical peak performance for the architecture. Since Skylake still uses AVX2 + FMA3, same as Haswell, I expected roughly the same performance. Maybe a bit better because of improvements in memory handling and instruction/data scheduling, improvements in thread dispatch, plus what other secret sauce they may have added.

Expected result was roughly,

Num cores * Core clock (base clock) * 8 (for AVX2) * 2 (for FMA3)

[ … things like “all core turbo” improve this but it’s a little unpredictable under heavy load ]

For Core-i7 6700K theoretical peak would be around,

4  * 4 * 8 * 2 = 256 GFLOP/s

85% of that would be around 218 GFLOPS [ and we really expect more than 85%! ]

The Haswell Core i7-4790K @ 4GHz which would have the same theoretical peak gives a very respectable 234 GFLOP/s The Skylake 6700K @ 4GHz only gave 200 GFLOP/s and we really expected to see something around 10% higher than the 4790K based on other testing we had done. That 200 GFLOP/s result for Skylake was disturbing.

Fedora 23

I decided to retest and for the install I went bleeding edge on the Linux OS so that the system kernel would be fully aware of the Skylake architecture. I went with the TC4 beta build of the upcoming Fedora 23 Linux distribution.


https://dl.fedoraproject.org/pub/alt/stage/23_Beta_TC4/Live/x86_64/Fedora-Live-MATE_Compiz-x86_64-23_Beta-TC4.iso

[kinghorn@localhost ~]$ uname -r
4.2.0-1.fc23.x86_64

This was using kernel version 4.2.0. However, other than not getting “Unrecognised Hardware” messages I don’t believe using this Fedora 23 beta had anything to do with the improved performance we saw. … but I haven’t verified that!

Intel Parallel Studio XE 2016 to the rescue.

The testing that we did a few weeks ago used the Linpack build from MKL included with Intel Parallel Studio 2015 update 3. That compiler did not really know about any architectural changes in Skylake. The 2016 version of the compilers along with an updated MKL were just recently released so I decided to recheck Linpack to see if the new dev tools would make a difference … it made a huge difference!

Intel MKL Linpack Benchmark — Core i7-6700K

Intel Parallel Studio XE 2015 update3
MKL 11.2
200 GFLOP/s
Intel Parallel Studio XE 2016
MKL 11.3
256 GFLOP/s

Here’s some of the output from the new Linpack test run,

Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz

Number of CPUs: 1
Number of cores: 4
Number of threads: 8

Maximum memory requested that can be used=16200901024, at the size=45000

=================== Timing linear equation system solver ===================

Size     LDA        Align.   Time(s)        GFlops    Residual           Residual(norm) Check
30000  30000  1          70.881         253.9736 6.426480e-10 2.533325e-02   pass
35000  35000  1          111.866        255.5348 7.896800e-10 2.292321e-02   pass
40000  40000  1          167.556        254.6599 1.071610e-09 2.383298e-02   pass

That is an amazing improvement! It is even better than what was initially expected — Why?!

Why indeed! That is a ridiculously large improvement. Looking back at the transition to Ivy Bridge and Haswell I remember seeing only small improvements from compiler and library updates. I really don’t know why the Linpack run from Parallel Studio XE 2015 was so bad or why the 2016 version gives a better than expected result. I have not seen a good analysis of the architecture changes yet so I’m puzzled.

I looked at the CPUID flags to see what was different for Skylake compared to Haswell.

CPUID flags in Skylake but not in Haswell

  • 3dnowprefetch : 3DNow prefetch instructions
  • hwp : Hardware Managed Performance States
  • hwp_noitfy :
  • hwp_act_window :
  • hwp_epp :
  • intel_pt : Intel Processor Tracing
  • hle : Hardware Lock Elision
  • rtm : Restricted Transactional Memory
  • mpx : Memory Protection Extension
  • rdseed : The RDSEED instruction
  • adx : The ADCX and ADOX instructions
  • smap : Supervisor Mode Access Prevention
  • clflushopt : Optimized CLFlush
  • xsaveopt : Optimized Xsave
  • xsavec : Save Processor Extended States with Compaction
  • xgetbv1 :

My biggest suspicion of where the performance anomaly may have come from is related to the last entry above. XGETBV is mentioned in the following excerpt from an Intel article on AVX instructions.


  https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions

Detecting Availability and Support
Detection of support for the four areas-Intel AVX, FMA, AES, and PCLMULQDQ-are similar and 
require similar steps consisting of checking for hardware and operating system support for 
the desired feature (see Table 1). These steps are (counting bits starting at bit 0):
   1. Verify that the operating system supports XGETBV using CPUID.1:ECX.OSXSAVE bit 27 = 1.
   2. At the same time, verify that CPUID.1:ECX bit 28=1 (Intel AVX supported) and/or bit 25=1
   (AES supported) and/or bit 12=1 (FMA supported) and/or bit 1=1 (PCLMULQDQ) are supported.
   3. Issue XGETBV, and verify that the feature-enabled mask at bits 1 and 2 are 11b 
   (XMM state and YMM state enabled by the operating system).

It may be the case that during the execution of Linpack from the older compiler build the executable just did not recognise some of the features of the processor and took a less than optimal code path. Or, it could be related to some unintended power state restriction. These are just guesses. If you have any insight as to why we saw such a large performance discrepancy then please feel free to add a comment. If I find anything else I post it too.

Happy computing! –dbk