Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1059
Dr Donald Kinghorn (Scientific Computing Advisor )

Intel Core-i9 7900X and 7980XE Skylake-X Linux Linpack Performance

Written on October 10, 2017 by Dr Donald Kinghorn
Share:

Intel Core-i9 7900X and 7980XE are very good desktop processors for mathematical computing workloads. This post is a short listing of results for the Linpack benchmark which is still my personal favorite CPU performance metric.

These Skylake-X Core i9 "desktop" processors benefit form having Intel's latest vectorization hardware, AVX512. AVX512 doubles the vector width from 256 bit to 512 bit over AVX2. This feature is one of the most significant technologies in Intel's new high-end "scalable processor" architecture Xeon processors a.k.a. Purley, a.k.a. Skylake-SP. It is nice to see that this technology is included in these "desktop" CPU's. There is also a Xeon Skylake-W processor that is very similar to Skylake-X "desktop" processors.

The systems I tested were based on the Puget Systems "Peak Mini" compact HPC workstations. The relevant specs are;

Hardware:

  • Intel Core i9 7900X 3.3GHz Ten Core

    • Base clock 3.3GHz

    • All-Core-Turbo 4.0GHZ

    • Max Turbo 4.5GHz

  • Intel Core i9 7980XE 2.6GHz Eighteen Core

    • Base clock 2.6GHz

    • All-Core-Turbo 3.4GHz

    • Max Turbo 4.4GHz

  • EVGA X299 Micro mATX Motherboard

  • 64 GB DDR4-2400 Memory

[I also had 2 NVIDIA Titan Xp GPU's in the 7980XE system but that's another story :-) ]

Software:

  • Ubuntu 16.04 kernels 4.4.0, 4.11.0 and 4.13 ( Performance was identical on all kernels, but see note below )

  • Intel MKL 2018 (Math Kernel Library)

  • Intel optimized Linpack Benchmark

Note: The CPU frequency reported in /proc/cpu info was always either the base clock or 1.2GHz low power state. I tested different kernels but never saw any difference. I did install linux-tools and used "cpupower" which did report correct CPU core frequencies extracted from "Intel P-state" information. The performance numbers presented here are consistent with listed frequencies for single core and many core "Turbo" frequencies. I have information listed at the end of this post that shows this.


Linpack Benchmark Results

Now for what you came here to see! I was unable to find any reports of these Linpack benchmark numbers from a google search. The performance is very good! I'm not really doing a comparative discussion in this post but will included one number from a dual Xeon 2690v4 Broadwell system that I recently ran the same benchmark on so you can see how well these "desktop" Core i9 processors do under heavy mathematical compute load.
GFLOPS = "Billions of Floating Point Operations per Second"

  • Intel Core i9 7900X -- 638.9 GFLOPS

  • Intel Core i9 7980XE -- 977.0 GFLOPS


  • Intel Dual Xeon 2690v4 -- 1123 GFLOPS

Note: The dual Xeon 2690v4 system has 28 "real" cores and had 512GB memory. The Linpack number for that systems was at a very large problem size of 100,0000 equations using nearly all of that 512GB memory! More typical is around 980 GFLOPS.


These Core i9 processors were remarkably good on this standard HPC benchmark. They were both near 100 GFLOPS with a single core thanks to the high CPU clock boost from "Max-Turbo". The CPU clocks on Intel processors decrease from the Max-Turbo frequency down to the All-Core-Turbo frequency with increasing power draw i.e. number of loaded cores. In the table below you can see that the 18-core 7980XE did exceptionally well at 8 and 10 cores. That's because it is still operating near it's max frequency at that point compared to the 7900X which is at it's all-core frequency by then.


Intel Core i9 7900X and 7980XE Linpack GFLOPS

CPU Cores i9 7900X GFLOPS i9 7980XE GFLOPS
18 977.0
16 954.5
10 638.9 773.9
8 576.0 648.3
4 341.8 347.6
2 191.0 185.7
1 97.5 95.2


The following plot gives a visual representation of the Linpack performance of these outstanding processors.

CPU Linpack GFLOPS


Incorrectly reported Linux ACPI CPU frequency states

CPU frequency vales in /proc/cpuinfo and those reported by cpufreq-info were incorrect as seen in the output from cpufreq-info below from the Core i9 7980XE

analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 10.0 us.
  hardware limits: 1.20 GHz - 2.60 GHz
  available frequency steps: 2.60 GHz, 2.60 GHz, 2.50 GHz, 2.40 GHz, 2.30 GHz, 2.20 GHz, 2.10 GHz, 2.00 GHz, 1.90 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.40 GHz, 1.30 GHz, 1.20 GHz

Using "cpupower" which access the intel_pstate does seem to report correct frequencies but they are very dynamic under load.

Following is output from running the Linpack benchmark with OMP_NUM_THREADS=1. This is consistent with the listed Max Turbo frequency

kinghorn@mini:~$ sudo cpupower monitor -m Mperf | sort -k2 -r
CPU | C0   | Cx   | Freq 
   0| 99.77|  0.23|  4357

Below is output from the benchmark run with OMP_NUM_THREADS=18. This is consistant with the All-Core_turbo frequencies.

kinghorn@mini:~$ sudo cpupower monitor -m Mperf | sort -k2 -r
CPU | C0   | Cx   | Freq 
   5| 99.99|  0.01|  3355
   8| 99.99|  0.01|  3354
  11| 99.99|  0.01|  3349
  10| 99.99|  0.01|  3349
   1| 99.99|  0.01|  3348
   3| 99.99|  0.01|  3342
   2| 99.99|  0.01|  3341
   7| 99.99|  0.01|  3308
   6| 99.99|  0.01|  3308
  12| 99.98|  0.02|  3363
   9| 99.98|  0.02|  3354
  15| 99.98|  0.02|  3353
  14| 99.98|  0.02|  3352
   0| 99.98|  0.02|  3347
  16| 99.97|  0.03|  3355
  17| 99.83|  0.17|  3355
  13| 99.76|  0.24|  3363
   4| 99.72|  0.28|  3354

Happy computing! --dbk

Tags: Intel Core i9, 7900X, 7980XE, Skylake-X, Linpack, Linux, HPC
Brad Jascob

I'm curious under what conditions you tested your systems. My 7940x is giving as high as 1035 GFlops using the pre-compiled l_mklb_p_2018.0.006 from intel (running their runme_xeon64 script which tests multiple sized arrays) under Linux 1710. LDAs over 10,000 are giving between 900 and 1035 GFlops. Intel advertised that the 7980xe was their first TFlops class consumer processor but I was surprised to see my 14 core hitting these numbers.
BTW, on my system, looking at /sys/devices/system/cpu/cpuXX/cpufreq/scaling_cur_freq, shows 3.8GHz through most of the tests, though it did sometimes drop back to the base clock of 3.1GHz for a few seconds. The 3.8GHz is it's max 14-core turbo, whereas the 7980xe has a 3.4GHz max turbo when all 18-cores are running. I'm sort of curious if the 18-core is actually not performing any better than the 14-core because it's generally running at a lower clock rate.

Posted on 2017-11-01 03:04:48
Donald Kinghorn

Hi Brad, Sorry, I just saw your comment! Your performance numbers are more representative of what the processors will do. The test runs I did were on early release samples (including the motherboards) I believe the clocks were not quite right. These results were just quick job runs with sizes less than 40,000. I have done other testing with larger problem sizes and gotten results over 1TFLOP as you have observed.

Your 14 7040X core is the best processor of the bunch! The 18-core does drop to a lower clock. These processors are crazy with the clocks because of the AVX-512 vector units. Check out the following post. There is a link at the bottom to the spec sheet that has all of the turbo clocks. There are 4 clocks!
Best wishes --Don

Intel Scalable Processors Xeon Skylake-SP (Purley) Buyers Guide https://www.pugetsystems.co...

Posted on 2017-12-12 05:23:40
Tugrul_512bit

Can Intel's own OpenCL driver optimize a kernel to use AVX512 up to 70% or similar efficiency compared to MKL?

Also Xeon Silver 4114 has just 10 cores with only 1xAVX512 per core and 1400MHz turbo(for AVX512). Does this mean its gflops value would be 100-200 Gflops? Because when scale 1400MHz vs 3.4 GHz, already 2.4x performance difference and adding 1xAVX512 vs 2xAVX512 gives another 2.0x performnace and having 10 cores vs 18 cores would mean 1.8x performance difference. 1.8 * 2.0 * 2.4 = 8.64 times the gflops. 1 Tflops / 8.64 = 115 gflops. Is this low performance for a 10-core new Intel chip?

Maybe just AVX2 is much faster(2.2GHz turbo) at 200 Gflop/s on silver xeons since both avx and avx512 use same pipelines (fused) and there is no extra AVX512 unit until gold6126?

I have an fx8150 at 3.7 GHz(no turbo) and on windows, aida64 gpgpu benchmark says, it has 118 gflop/s for double precision. Maybe aida64's benchmarking does not do any meaningful real-world computations and just gets theoretical values?

When comparing silver4114 vs bulldozer: 2xAVX256 pipelines per core vs 1xSSE128 pipeline per core so each core of 4114 is 4x of bulldozer core at same frequency. This would mean 1x4114 cpu is like 40 core bulldozer which means 5 times the performance--> 118*5 = nearly 600 gflop/s on double precision for single 4114.(1.2 Tflop/s for single precision?)

I am going to update my system for my life science work but I can't decide between Xeon silver 2x4114 and Epyc 2x7281 on this "available gflops" on OpenCL acceleration.

Posted on 2018-03-03 19:38:07
Donald Kinghorn

Welcome to the wonder world new world of processor chaos! You have a lot of question and observations ... bottom line is that it's hard to know for sure without running code! How all of the clocks interplay is hard to predict. Big thing to keep in mind for AVX512 is the clock is significantly slower when it it in use. That can make predictions difficult. It comes down to does the depth of AVX512 make up for the slower overall core clock. It's a mess! 4 different core clocks depending on how the processor executing code! The plots I have in the Xeon scalable post should give a reasonable idea of what different workloads will do and the relative performance of the processors.

I don't know anything about openCL performance (kind of wish I did ...)

ada64 is a good stress tester for sure but I'm not so sure that it's good for performance comparison especially with CPU's being as complicated as they are now.

I am just now starting to mess with multi-socket Xeon scalable. As a teaser the insanely expensive 8180 in a dual socket config gave me around 2.2 TFLOPS double precision Linack using MKL

Those Xeons with only 1 AVX unit don't make any sense to me. I'm guessing that is just a way to salvage defective silicon i.e. disable an AVX unit and sell it ... but I doubt that having 2 of them actually doubles performance of any real code. It will likely be some fraction of that. In any case my personal preference would be to go with the single socket Xeon-W instead or the Skylake-X if you don't need the extra mem capacity and ECC. (it's a shame that the X299 will not support Reg mem any more after the Spectre and Metldown patches )
The best thing about those 4114's is going to be the price but I'm not sure if they are a "good deal" or not.

EPYC is a big question mark for me. It's is taking a very long time for AMD to get samples to us! I have hope for the processor but I'm not too optimistic anymore. At SC17 there was a lot of interest because everyone was unhappy with Intel for the "Purley" mess and pricing. (there was a lot if interest in the serious ARM procs too!) A lot of big players were doing testing on EPYC (that's part of the reason they have been hard to get a hold of). In the end I think Microsoft on Azure is the only cloud provider that put any of them online. I haven't tested with those on Azure but if AMD doesn't get some stuff to us soon I might do that.

I wish I had better advise for you! I will test EPYC and I hope to have to some Xeon Scalable testing posted soon but in the mean time I don't envy anyone trying to decide on what to put in a system right now. However, in general the new processors are great! I just love the new Xeon-W and Skylake-X I think they are a bargain for performance/cost.

Best wishes --Don

Posted on 2018-03-05 18:03:10
Tugrul_512bit

Thank you for sharing time,

I will be looking for your EPYC and other Xeon Skylable tests too. Guessing silver line has no cache bottlenecking issues at least, which can be 90%-100% efficient(or scaling) even when there is some cache dependency? (or cache drops with turbo too?)

I checked some geekbench-4 CPU tests and 1-AVX capable Xeons had 40-60 Gflops on SGEMM and i9 series had like 170 Gflops on SGEMM, all on single core. There is 4x performance multiplier between 2-AVX pipelined cores and others, per core. Looking at results, this makes silvers "look" more scalable :) compared to 2-AVX versions such as W series you mentioned.

Also checked for some Gold-6126 (which is cheapest, many-core, dual-AVX-unit, scalable) benchmark results that had relatively high per-core performance like W series and for $1700, nearly having 2.5x performance of $700 Silver-4114, looks much more preferable than any 2x Silver setup. If someone had enough money to buy 2xSilver, why not Gold? If budget is problem or 1 CPU enough, 2155W can be fastest then(with other reasons too, as you said).

But If I'm going to have 4x CUDA/OpenCL GPUs + CPU(s) work on same workload and if each GPU needs 3-4 threads to be communicated/controlled, then 2xSilvers can still have excessive gflops remaining ( 12Cx(40gflops/C) ) while single Gold/W retains only smaller fraction ( 4Cx(170gflops/C) or 2Cx(200gflops/C) ) of its peak value. I mean, Gold - W series would waste more precious gflops while driving 4xGPUs or they'd need to work partially serially. I wish we could mix 1xEPYC and 1xGOLD and have OS take care of binding GPUs to EPYC and computing to Gold.

Also If I have 1st thread of each core does GPU control while second thread does computing, would this slow GPU feeding(low turbo caused by AVX512 on other thread) or stutter the AVX512 pipeline because of other thread using/leeching all other resources at higher frequency?

I heard that if I mix Silver with Gold, then Gold gets downclocked to match Silver one and only if both have same core counts. This could be cool if there wasn't frequency limiting factor. Gold computes, silver drives GPUs, all concurrently(with increased programming effort ofcourse, or maybe OS is better than us to do this?).

Lastly, can Xeon W2155 be overclocked and still keep its stability as much as a bronze or gold scalable version? It looks if overclocked, it would match 3x 4114 Silvers easily. I also heard that pro CPUs overclock better than desktop CPUs but not sure if stability remains.(I mean, higher than i9's stability to keep running non-stop for one week at 90% load capacity)

I'm sorry for posting long messages, lots of assumptions and grammatical errors.

Respectfully,

Tugrul.

Posted on 2018-03-05 18:52:32
Kivanc KARANIS

2690-v4 does not have 28 cores but 14
https://ark.intel.com/produ...

Posted on 2018-05-03 01:48:45

I believe Dr Kinghorn had two of those Xeon CPUs in the system he tested, so 14 x 2 = 28 total cores. The key on the graph is not clear on this point, but we almost exclusively use / sell those Xeons in pairs.

Posted on 2018-05-03 16:18:17
Donald Kinghorn

Yes, that is a dual socket system that was tested. So it is 2 x 14 core ... These new Skylake X,W,SP processors have great performance potential. It's surprising to me that you can basically get the performance of a last gen dual with a single socket system.

There was also this note in the post, (but I didn't mention it in the chart annotations)

Note: The dual Xeon 2690v4 system has 28 "real" cores and had 512GB memory. The Linpack number for that systems was at a very large problem size of 100,0000 equations using nearly all of that 512GB memory! More typical is around 980 GFLOPS.

Posted on 2018-05-04 00:34:37
Kivanc KARANIS

Yes yes yes, very sorry for that. I`ve realized the note AFTER posting and deleted it asap but unfortunately it stayed somehow.

Posted on 2018-05-04 07:06:12
Kivanc KARANIS

I don`t want to be annoying (especially after an embarrasing intro) but I`d like to share my experience on 7980XE .

This article is based on "Linux Linpack Performance" and what is said in here is perfectly true, but benchmarks are a bit disturbing.
I love benchmarks, because it brings you the common platform to discuss things about.
But I hate benchmarks, because it`s like solving a physics problem in a frictionless environment. It is never real.

First difference between i9-780XE and Xeon is scalability. You can not scale dual or quad i9 but you can scale Xeons.
Second difference is, overclocking.
If your workload can be handled without scaling, you don`t need Xeon. This is the truth, if you can handle every overclocking steps with care.

Currently we are using WRF to forecast regional weather on CPU (without GPU) and my mission is to beat a target Xeon based system with a desktop.
WRF does many double precision floating point math with fortran code, and that goes for AVX512 but, it has many more packages to execute in pre-execution and post-execution and not all of them are AVX512 triggering. (I`ll come to that)

Hardware I`ve built is an asus rampage extreme VI (a gamer thing), i9-7980XE, 128GB 3200Mhz dominator platinum, corsair hx1200i, closed loop hx150i, 2x 4mm noctua fans on VRM and misc fans in the case for proper airflow.
The reason I`ve chosen a flag-ship gamer board is the ease of overclocking, since I`d push everything to limits.

Just a small note, if you are willing to push the limits, you have to use a custom compiled kernel. Otherwise, you may not be able to even get right readings of cpu frequencies from /proc or cpufreq-info. Furthermore, compiler flags are your friends to take out the most from your source code so you have to focus on them a lot. Using PGI in my case was needed because, Fortran code is very nice compiled there. (nvidia worked hard since cuda 1.0)

The important thing on my board was the ability to scale down clock multiplier for AVX512 and that was everything that matters.
I could achieve standard 4.6GHz and AVX512 to 4.3GHz stable in standart workload conditions and accomplish WRF runs, faster than my target.
I could not finish a Linpack run.
I knew if I did more tuning on my tuning, I could achieve that also but I did not built this system for Linpack performance.

When it comes to overclocking, your focus is to manage the speed and amount amount of increased voltage sent to components, securing temperatures.
This is why I do not like benchmarks. Benchmarks stress everything always but in real life, you generally do not do that, CPU has a time to breathe writing to disk, parsing input etc.

As a conclusion,
i9-7980XE is a bit far away to talk about benchmarks. Since it is an overclocking cpu and limited to 1 processor, you have to buy one, optimize your kernel and compiler flags and keep temps low. It will crash, and you`ll tune it again. BUT you will catch it and it works rock solid.
Do not confuse your mind comparing benchmark results because they will not reflect your actual usage for this CPU because it behaves very divverent when going into AVX512 or not.

I couldn`t get into delidding on this first cpu yet but it is worth doing, after making this one working stable, next one will get delidding.

Xeon`s, ah yes, I love them. But if your workload does not need scaling, you do actually beat them with 7980XE with proper configuration.

Posted on 2018-05-07 20:58:29