Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1068
Dr Donald Kinghorn (Scientific Computing Advisor )

Skylake-X 7800X vs Coffee Lake 8700K for compute (AVX512 vs AVX2) Linpack benchmark

Written on November 8, 2017 by Dr Donald Kinghorn
Share:

Of the many (too many!) desktop CPU's that Intel has released this year there are two that stand out -- Skylake-X and Coffee-Lake. In this post I'll look at the numerical computing performance of the Intel core i7-7800X and the 8th generation core i7-8700K.

For the impatient:
Skylake-X 6-core beats Coffee-Lake 6-core by about 40%

I judge CPU's by the maximum raw compute performance they can deliver. For that judgment I can't think of anything better than the Linpack benchmark. The Linpack benchmark optimized with the Intel Math Kernel Library (MKL) is very near the peak theoretical floating point performance for Intel CPU's.

The Linpack benchmark is a numerical solution of a system of linear equations. It involves matrix operation that are typical in mathematical and scientific applications. It is efficiently parallel on individual multi-core processors and can scale well on large compute clusters. It can be optimized to take advantage of special processing hardware in a CPU like Intel's AVX vector units and Fast Multiply Add (FMA) units. It is also the main benchmark that is used to rank the Top500 fastest supercomputers in the world. So, why not use it to see what modestly priced desktop processors can do!

I recently did a blog post titled Intel Core-i9 7900X and 7980XE Skylake-X Linux Linpack Performance. I was stunned by compute capability of the Skylake-X core i9 processors. They range in core count from 10 to 18 and provide 44 PCIe lanes. They are great processors but they are a little expensive. The core i7 processors we are looking at in this post are more modest 6-core processors that provide fewer PCIe lanes but are both priced nearly the same (under $400). They are tempting offerings for a desktop system that will occasionally get some serious compute workloads.


Brief comparison of the CPU's


Intel Core i7-7800X and i7-8700 Features

Features i7-7800X i7-8700K
Cores/Threads*** 6/12 6/12
Base Clock 3.5GHz 3.7GHz
Max Turbo 4.0GHz 4.7GHz
All Core 4.0GHz 4.3GHz*
Cache 8.25 MB 12 MB
TDP 140 W 95 W
Max Mem 128 GB (512 Reg**) 64 GB
Mem Channels 4 2
Max PCIe lanes 28 16
Vector Unit AVX-512 AVX2

The features that will have the biggest impact on compute performance are core clocks, AVX unit and Cache. The high clock speeds, fast memory, large cache and low power consumption of the Coffee-Lake processor is very compelling. However, the last item in the table above, AVX, can have a significant impact on numerical compute performance.

*Note: Clock frequency observed during 6-core benchmark run.
** Registered memory can be used on some X299 motherboards allowing up to 512GB memory
*** Hyper-threading is basically useless for this workload. However, in general, I recommend checking to see if there is a performance benefit.


Hardware:

  • Intel Core i7 7800X 3.5GHz Six Core

    • EVGA X299 Micro mATX Motherboard

    • 64 GB DDR4-2400 Memory

  • Intel Core i7 8700K 3.7GHz Six Core

    • Gigabyte Z370 Motherboard

    • 64GB DDR-2600 Memory

Software:

  • Ubuntu 16.04 kernel 4.13

  • Intel MKL 2018 (Math Kernel Library)

  • Intel optimized Linpack Benchmark

I'm running Linux for this testing but there is no reason to expect that the same workloads on Windows 10 would show any difference in performance.


Results

The table below and the following chart makes it clear that the AVX512 vector unit on the Skylake-X processor is significantly more important than the higher clock frequencies on Coffee-Lake for this type of workload.


Intel Core i7-7800X and 8700K Linpack GFLOPS

CPU Cores i7-7800X GFLOPS i7-8700K GFLOPS
6 500.0 361.3
4 362.1 252.8
2 199.8 132.2
1 100.5 70.0

The following chart show the significantly better performance of the core i7-7800X over the i7-8700K when using 1, 2, 4, and 6 cores running the Linpack benchmark.

CPU Linpack GFLOPS

Which to choose?

It is not really as clear cut as the results above might suggest! The Coffee-Lake processor has a high clock frequency and fast memory and just with a desktop install it just "feels" fast. We have also done other testing that shows that it is indeed fast! Many applications perform very well on Coffee-Lake. The Skylake-X processors are probably better suited for workstation applications. It's well suited for researchers that are using applications or writing code that makes heavy use of numerical linear algebra. Linking with Intel MKL for matrix operations or using the Intel compilers auto (or manual) vectorization tools could give a significant speed boost to programs when running on these new Skylake (or Xeon Purley) processors.

For me personally? Skylake-X! But, I want the core i9 versions with higher core count and more PCIe lanes. I may be getting myself a nice Christmas present this year :-)

Happy computing --dbk

Tags: Intel CPU, Skylake-X, Coffee Lake, Linpack
Chivster

Intel Core i7-7800X and 8700K Linpack GFLOPS

CPU Cores i7-7900X <--- typo GFLOPS i7-8700K GFLOPS

6 500.0 361.3
4 362.1 252.8
2 199.8 132.2
1 100.5 70.0

The following chart show the significantly better performance
of the core i7-7800X over the i7-8700K when using 1, 2, 4, and 6 cores
running the Linpack benchmark.

Posted on 2017-11-29 07:07:03
Donald Kinghorn

Thanks I fixed it ... I had been testing the 7900X too, really like the core i9 processors. The 7980XE is one of the best processors I've worked with so far. I really like the high core clocks and AVX512!

Posted on 2018-01-11 05:10:27
Chivster

You're welcome. Still, the premium price for the few PCIE lanes you get when I want to run SLI x16x16 (and I have two nvme pcie drives and want to add a third) is off putting... I'm almost ready to sacrifice a few fps and go AMD when in 4k gaming the high core clocks aren't costing that much.

Posted on 2018-01-14 08:38:07

Dr. Kinghorn, as usual thank you for taking the time write these powerful articles.
Are there any plans to include custom LTSpice (or PSPICE) and MATLAB benchmarks? Some circuit simulations with SPICE programs can be serious stressors.

Posted on 2018-01-21 22:00:24
Donald Kinghorn

I agree those would be very good benchmarks! I haven't run a SPICE simulation in ages and when I did it was only small circuits. (but I have been very tempted to try to do a vacuum-tube guitar amp simulation :-) I haven't looked at any of the simulation programs with respect to performance. However, anything liked to MKL should see a nice performance boost. It would even be worth recompiling to be sure that an up-to-date MKL with AVX512 gets linked in.

I have been trying to get Math Works to give me a testing copy of Matlab for years! no luck! I really like Matlab but I can't afford it. (also, I'm mostly using Python with numpy for those use cases now ... and I'm anxious to see version 1 of Julia!) One great thing about Matlab is that t includes a copy of MKL by default. I also, know that you can replace the included copy with a newer version (with some effort).

Having said all of that I probably wont be doing any of that testing even though I would like too, and know that it would be useful to the community. I'm spending most of my time now working on machine learning, DNN, etc., and really enjoying that! It would take me several weeks to setup any meaningful tests and I don't have that time available. ... it would be REALLY interesting to setup a "reinforcement learning" training model on a SPICE simulation! Best wishes --Don

Posted on 2018-01-22 20:54:05
thetrystero

will the difference be as large on tensorflow ? I'm guessing probably not right since that's more GPU intensive? or am I wrong?

Posted on 2018-05-14 04:32:49
Donald Kinghorn

That is a really good question! TensorFlow has really good GPU acceleration and it generally scales well across multi-GPU's too. So, if you are running TF then you probably want to take advantage of that. However, there is still some overhead on the CPU and you may not always want to use the GPU. ... maybe you need more mem or you are running smaller dev jobs and don't want to wait on the TF GPU startup overhead etc..

Also, if you are using Anaconda Python it has libs like numpy linked against Intel MKL and AVX512 does make a nice difference with that.

The latest TensorFlow does, finally, link against AVX (not AVX2 or AVX512). I did some testing on compiling TensorFlow from source that's in these posts
https://www.pugetsystems.co...
https://www.pugetsystems.co...

I saw around 2.5 speedup with my build linked with MKL including AVX512

So, it's nice to have the new Skylake-X(W,SP) CPU's but ANY newer CPU will be pretty good in general and for something like TensorFlow when you are doing more demanding work, GPU acceleration is the way to go. Almost any 9xx or later NVIDIA GPU will give much better CNN, DNN, RNN ... training performance. So something like a 8700K and a 1080Ti would be an really nice rig for machine learning work. (and a great gaming platform too :-)

It had been 5 years since I updated my personal systems at home and the Skylake-X(W) testing I did convinced me it was time to upgrade. I went for a really nice setup with a Xeon-W 2175 (14-core) 128 GB mem and nice storage along with a 1080Ti. It is really nice!! The one thing I may still do is add a Titan V. I've been testing those cards and they are very good. I'm becoming convinced that they are worth the extra cost. ... I may wait to see what the next generation of GeForce cards look like though, or the next Titan update. For now I'm pretty happy and I have at least some access to multiple Titan V's now and then :-)

Posted on 2018-05-14 16:28:59
Guido

Hi Donald,

Have you tried benchmarking with different memory channel configurations? I'm wondering if you get more throughput if all 4 slots are filled on the skylake vs just 2 memory slots filled. On the i7-8700k cpu+mobo I get almost double the throughput on my computations if I use 2 x 16GB memory vs just one 16GB memory sim.

Posted on 2018-05-31 23:19:21

I can't speak for Don directly, but in general we benchmark stuff with all the memory channels in use - though not necessarily maxed out on capacity. For example, with the 8700K you mentioned, it is dual-channel... so we would only test it with 2 or 4 sticks of memory in the proper slots. Anything less (just one DIMM, for example) could lead to reduced performance by limiting the bandwidth between the CPU and RAM. That sounds like what you have observed in the situation you described.

Posted on 2018-05-31 23:24:45
Donald Kinghorn

That is a good observation but really I don't see much difference with half full vs full slots regardless of the number of mem channels with something like Linpack. If there is enough memory throughput that the caches can keep the AVX registers full you will get near max floating point performance for the processor.

For benchmarking with something like Linpack the memory effect that seems to makes the most difference is just how much of it you have. You get a pretty good idea of a processors perf at a problem size that uses up around 16GB but to get the max for a processor you get better results when you go to problem size that uses around 90% of total memory and you have lots of it! If you run a problem size that will use up around 450GB of 512GB you will see better results. (10-20%)

Linpack performance is the standard performance measure for Supercomputers. People go to a lot of trouble trying to get optimal results when they are trying to get on the Top500 list. ... it's kind of a game ... a site that I've always liked is the "calculator" at
http://hpl-calculator.sourc...
You can enter numbers in there for your (1-node) system and get an estimated score. You can look up the date of the Top500 list that your system would have made it onto the list. It's really too simplistic for recent processors since they have multiple clocks and difficult to estimate ops per cycle ... also, these days it is all about GPU acceleration! Best wishes --Don

Posted on 2018-06-01 16:26:24
Klamer Schutte

So did you indeed use all four memory channels on the Skylake 7800X? I am intrigued by the not so good scaling of its numbers to 6 cores and wonder whether that is ram bandwidth limited? The Coffee Lake processor is very close to its theoretical maximum, while the skylake seems to be under.... What do you expect to be the reason for that?

Posted on 2018-12-13 20:10:28
Donald Kinghorn

The 7800X scaled very well so I'm confused by your comment. That's a great result on 6 cores. I can say one thing though, when this was done over a year ago everything was very new and there was some "funny business" with BIOS's from the board makers. Any done around this time should be taken with a "grain of salt". If you want to see some more testing where the AVX512 vector units shine look at the recent testing I did comparing to Threadripper 2990WX. I built an optimized HPL Linpack for AMD but the Intel Xeon-W 2175 did much better.
https://www.pugetsystems.co...
Cheers mate!

Posted on 2018-12-15 03:22:31