Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1077
Dr Donald Kinghorn (Scientific Computing Advisor )

Intel Scalable Processors Xeon Skylake-SP (Purley) Buyers Guide

Written on December 2, 2017 by Dr Donald Kinghorn
Share:

Intel Purley platform, Skylake-SP, Xeon "Scalable" processors (Platinum, Gold, Sliver, Bronze) are here. All 58 of them! How are you going to sort out that mess? Well, hopefully this post will help.

I'll try to sort these processors out by "use-case" to come up with a more manageable list of choices. I'll trim the list of processors and do a price performance analysis for different usage scenarios. ... What do I mean by "use-case"?

Use-case Performance by Program Execution Characteristics

Application software will have some kind of performance limiting behavior. This is generally determined by problem domain, algorithm choice, inherent parallel scalability, memory I/O demands, programmer skill, and choice of performance libraries and compiler tools. These performance determiners map more generally to compute hardware as follows,

Serial CPU (single thread)

Software that has no parallelization is common. No multi-threads, no vectorization. This is common for code in scripting languages like Python and programs that are not performance critical or legacy code that no one has bothered to optimize. [note that scripting languages like Python are typically used as front ends for code that is highly parallel, ... like numpy linked to the MKL library ]

  • For these types of programs the most important CPU attribute is likely to be the Max Turbo-Clock Frequency and the Non-AVX All-Core-Turbo clock frequency if you are running many instances of these type of programs simultaneously. [possibly typical "business" server applications like web apps email etc.. ]

Multi-Threaded Non-Vectorized

This large class of programs have a multi-threaded implementation but do not take advantage of vectorization. These are programs that may have "task parallelism" but are not making heavy use of matrix vector math operations. These programs may have limited scalability unless they are embarrassingly parallel.

  • For embarrassingly parallel programs that have long run-times lots of cores are usually an advantage. Number of cores and and Non-AVX All-Core-Turbo frequency are important.

  • Many programs in this class will have limited scalability. They may even be "hard-coded" to a limited number of threads. In this case fewer numbers of cores and high All-Core-Turbo and high Non-AVX All-Core-Turbo would be good.

Multi-Threaded and Vectorized

This is the ideal case for utilizing the features of these new processors. In this case high core-count and AVX-512 All-Core-Turbo (or AVX2 All-Core-Turbo for software that has not been updated) will likely determine performance. These are programs typical of HPC workloads like simulation and machine learning. Programs that make heavy use of matrix vector math operations should perform very well when utilizing AVX-512 and FMA operations. This of course assumes that the programmer has optimized their code! Programs that make library calls to Intel's MKL (Mathe Kernel Library) or DAAL (Data Analytics Acceleration Library) should give excellent performance on the new Xeon processors.

Note: when the AVX vector units are under load the CPU clock frequency can be significantly lowered! This is true for AVX2 and especially significant for AVX-512.

Memory I/O Bound

This is one of the most disturbing cases. I occasionally hear someone say "I got the latest, best, hardware and my program doesn't run any faster!" If the software is inherently I/O bound or the software is poorly written with bad memory layout, having lots of cache misses etc. it may be difficult to improve performance with "better" hardware. In some cases large per-core cache may help. Also, keep in mind that the L3 Cache (Last Level Cache) is shared across all cores. Therefore a high core-count CPU with a large total L3 Cache may help. If job run feasibility is limited by the amount of available memory then the Xeon processors that are marked by a trailing M will support up to 1.5TB memory each (I will not consider these processors in my analysis).

Who Cares about the CPU!

Lets face it, there are a lot of important modern programs, frameworks, and libraries that get their performance from GPU Acceleration. In particular programs that utilize NVIDIA CUDA on Tesla or GeForce graphics cards can out perform their CPU based counter-part by an order of magnitude! Some of these programs will still need good CPU support for sections of the code that cannot be accelerated on the GPU. Those code sections will fall into one of the classes listed above. For programs that get nearly all of their performance from the GPU you just need enough CPU to support the GPU's

Let's now start the process of weeding through this big pile of new processors.


Trimming the List of Processors (initial)

There are too many Intel Xeon Scalable processor product ID's.

Table of all 58 Processors Numbers

8180M81808176M8176F81768170M8170816881648160T
8160M8160F81608158815681536154615261506148F
6148614661446142M6142F61426140M61406138T6138F
613861366134M613461326130T6130F613061286126T
6126F612651225120T51205119T511851154116T4116
4114T4114411241104109T410831063104

There they are. Which one do you want?!

Intel has specified these processors into "metal" classes, Platinum (81xx), Gold (61xx), Silver (51xx), and Bronze (31xx).

Lets try to make this a bit more manageable. There are a lot of processors that can be eliminated from consideration ( for our purposes ).

To start, note that some of the processors have a letter at the end of their product ID number (M,F,T). These represent,

  • M -- Large memory capable, up to 1.5TB ("normal" is 768GB)

  • F -- Intel Omni-Path fabric (high speed network fabric)

  • T -- Thermal optimized ( for 10 year life cycle )

M we might occasionaly be interested in, but F and T we can eliminate for sure. Let's also eliminate M since those processors are much more expensive than their "normal" version. They would also have the same performance as the "normal" version unless you really needed the large memory capacity. We can also remove the 2 "Bronze" processors since they are (lame) 1.7GHz with no Turbo-Boost and no Hyper-Threading. That gets rid of 25 processors from the list.

There are 2 more "redundant" processors. 8156 and 8158 are the same as 5122 and 6136 except the 8156 and 8158 processors allow up to 8 socket systems as compared to a maximum of 4 sockets for the 5122 and 6136. Also, the 8156 and 8158 both cost $7007 while the 5122 is $1221 and the 6136 is $2460. I don't think you want to pay an extra $6000 per processor unless you really, really have to have an 8 socket system! That brings our elimination count up to 27 ... I think we can eliminate a few more ...

I'll drop the processors that have a Non-AVX All-Core-Turbo clock frequency less than 2.5GHz. They drop down to 1.6GHz or less for AVX-512 All-Core-Turbo. Those are 4108, 4110, 4116 and 8153. That gives us 31 to eliminate leaving 27 to look at. That is still too many but we'll need to look at use-case price performance to find the best processors. Here's processor ID's we will look at in more detail.

8180817681708168816481606154615261506148
6146614461426140613861366134613261306128
6126512251205118511541144112

Data Analysis Price Performance and Use-Case

Performance measure

The first thing we need is a discriminatory metric to differentiate the processor performance. I'm going to use the following approximation to the theoretical performance,

Performance = Cores x TurboFreq x VecWidth x #FMA

Where,

  • Cores -- is the number of CPU cores (not considering Hyper-Threads)

  • TurboFreq -- is the relevant clock frequency (more on that below)

  • VecWidth -- is the AVX vector width for double precision floating point numbers ( It will be 4 for AVX2 and 8 for AVX-512 ).

  • #FMA -- This will be the number of FMA AVX units (Fast Multiply Add) A few of the lower performance processors have 1 FMA unit and the rest have 2.

This would give a number that could be interpreted as GFLOP/s (Billions of floating point operations per second). This is not a very good estimate of peak floating point performance for these complicated processors but it will serve our needs as a performance measure. [We'll also consider cache size for memory bound application]

CPU Clock Frequencies

There are 5 different CPU clocks for these new Xeon processors!

  • Base Clock -- This is the clock frequency that the processors would run at if all "Turbo-Boost" and "power management" was disabled in the system BIOS. This would be the frequency that would achieve the TDP power draw for the processor. It is basically useless information for most end users. It is not used for any performance estimation. However, it is typically used as part of the label for the CPU's.

  • Max Turbo Frequency -- This is the highest clock for the processor and is generally achieved when up to 2 processes are running on a many core processor i.e the other cores are idle. This number is important for non-parallel (serial) jobs that may be may be important in routine work. It's the clock that determines how "snappy" your system feels.

  • Non-AVX All-Core-Turbo -- This is the most important clock frequency for the CPU. This is the maximum clock that the CPU can run all of it cores at when the AVX vector units are not being utilized. Programs that are not optimized (or optimal) for matrix vector operations but that do have good multi-threaded scaling will likely be performance limited by this clock. This also applies to work-flows that require many simultaneous application programs.

  • AVX2 All-Core-Turbo -- Yes!, when the AVX units are active the CPU core clocks decrease, and they decrease differently for AVX2 and the newer, twice the bit-width, AVX-512 vector units. The Xeon Scalable processors support SSE4.2, AVX, AVX2 and AVX-512 vector operations. These operations can have a huge impact on well optimized programs that make heavy use of matrix vector math. This is the performance that I personally judge processors by. It's the performance that is exposed by the (Intel optimized) Linpack benchmark. The vector units can give a performance boost of from 4 to 16 fold but it doesn't come for free. It requires a lot of power draw to run those parts of the CPU. There is a limited amount of power that can be safely run through a processor, and to achieve this, the CPU has to clock down it's core frequency in most cases.

  • AVX-512 All-Core-Turbo -- When these high performance vector units are loaded it is the worst case for the power draw on the processor. This vector unit can have a large throttling effect on the core clock frequency.

Intel has worked hard to optimize the maximum performance they can get out of the design for these new Xeon's. There is variability in the quality of the chips and they try to get the most performance they can. By doing this they end up with a lot of subtly (or not so subtly) different processors. They are trying to not waist silicon but they really have produced way too many processors in my opinion.

Lets look at some of the important numbers for these processors and then we'll look at price performance plots for different software characteristics.

Differentiating Processor Data

The following table has data for each of the processors we are considering without including data that is common to all of them.

Processor IDPriceCoresBase
Clock
Max
Turbo
All
Core
AVX2AVX512#FMACacheCache
per Core
Mem
Clock
TDP
8180 10009 28 2.50 3.80 3.2 2.8 2.3 2 38.5 1.375 2666 205
8176 8719 28 2.10 3.80 2.8 2.4 1.9 2 38.5 1.375 2666 165
8170 7411 26 2.10 3.70 2.8 2.4 1.9 2 35.75 1.375 2666 165
8168 5890 24 2.70 3.70 3.4 3.0 2.5 2 33 1.375 2666 205
8164 6120 26 2.00 3.70 2.7 2.3 1.8 2 35.75 1.375 2666 150
8160 4708 24 2.10 3.70 2.8 2.5 2.0 2 33 1.375 2666 150
8153 3115 16 2.00 2.80 2.3 2.0 1.6 2 22 1.375 2666 125
6154 3543 18 3.00 3.70 3.7 3.3 2.7 2 24.75 1.375 2666 200
6152 3661 22 2.10 3.70 2.8 2.4 2.0 2 30.25 1.375 2666 140
6150 3358 18 2.70 3.70 3.4 3.0 2.5 2 24.75 1.375 2666 165
6148 3078 20 2.40 3.70 3.1 2.6 2.2 2 27.5 1.375 2666 150
6146 3286 12 3.20 4.20 3.9 3.3 2.7 2 24.75 2.0625 2666 165
6144 2925 8 3.50 4.20 4.1 3.5 2.8 2 24.75 3.094 2666 150
6142 2952 16 2.60 3.70 3.3 2.9 2.2 2 22 1.375 2666 150
6140 2451 18 2.30 3.70 3.0 2.6 2.1 2 24.75 1.375 2666 140
6138 2618 20 2.00 3.70 2.7 2.3 1.9 2 27.5 1.375 2666 125
6136 2460 12 3.00 3.70 3.6 3.3 2.7 2 24.75 2.0625 2666 150
6134 2220 8 3.20 3.70 3.7 3.4 2.7 2 24.75 3.094 2666 130
6132 2111 14 2.60 3.70 3.3 2.9 2.3 2 19.25 1.375 2666 140
6130 1900 16 2.10 3.70 2.8 2.4 1.9 2 22 1.375 2666 125
6128 1697 6 3.40 3.70 3.7 3.6 2.9 2 19.25 3.208 2666 115
6126 1776 12 2.60 3.70 3.3 2.9 2.3 2 19.25 1.604 2666 125
5122 1227 4 3.60 3.70 3.7 3.6 3.3 2 16.5 4.125 2666 105
5120 1561 14 2.20 3.20 2.6 2.2 1.6 1 19.25 1.375 2400 105
5118 1273 12 2.30 3.20 2.7 2.3 1.6 1 16.5 1.375 2400 105
5115 1221 10 2.40 3.20 2.8 2.4 1.6 1 13.75 1.375 2400 85
4114 704 10 2.20 3.00 2.5 2.2 1.4 1 13.75 1.375 2400 85
4112 483 4 2.60 3.00 2.9 2.6 1.4 1 8.25 2.0625 2400 85

Notes: The price is is Intel's suggested price in USD. The CPU clocks are in GHz, the Memory clock is in MHz, the Cache sizes are in MB, and the TDP is in Watts.

I like looking at numbers but I will have several plots below to look at to find the best choice processors. A few processors in the table standout to me because of their obviously interesting features.

  • The 4112 is the least expensive and could be a good choice when the CPU doesn't matter that much i.e. maybe it's just there to support 4 or 8 NVIDIA GPU's for compute.

  • The 5122 stands out as the processor with the highest AVX-512 All-Core-Turbo. It is just 4 cores but they are all running full speed. It also has the largest per core Cache. This could be a good processor for memory bound programs and/or programs that don't have good parallel scaling but do take advantage of AVX512 vectorization.

  • The 6144 and 6146 have a high Max-Turbo and All-Core-Turbo with large Cache per Core.

  • The 8168 stands out as a good value for high core count and it has good AVX-512 All-Core-Turbo too.

Lets see what the plots have to show us.


Price vs Relative Performance Plots

I'm going to start with the Price vs Relative Performance plots for the best case software. That's software that is highly parallel (Multi-threaded) and highly vectorized for AVX-512. This is the software that would expose the best performance of the processors.

Higher performance is to the right and higher price is toward the top in the plots

AVX512 price vs perfromance

There are 11 processors that don't look very attractive in this plot and I have the advantage of knowing that they don't look very good in any of the other many plots that I made. I am going to remove these from the data set and redo this plot and use the reduced dataset for the remainder of the plots.

[Removing 8180, 8176, 8170, 8164, 8160, 6152, 6150, 8153, 5115, 5118, 5120]

The next plot is excluding these processors.

AVX512 price vs perfromance 2

This plot gives a clearer picture of the relative performance. The trend and reative placement of the processors differs only slighty for the AVX2 All-Core-Turbo case so we wont include that plot.

The next plot is using the All-Core-Turbo frequency scaled by the per-core Cache. This give more weight to the processors that may do better for jobs that are memory bound.

Big per-core Cache price vs perfromance 2

This plot forward the high clock processors with large memory cache.

The last plot is the using the All-Core-Turbo frequency. This has a similar distribution to the plot including AVX-512 but does shift a few processors around without making large changes in relative "value".

All-Core-Turbo price vs perfromance 2

From these last 3 plots the following processors stand out as offering good performance and value for a variety of use-cases. I'll now try to break down that usage.

[8168, 6154, 6148, 6144, 6140, 6134, 6130, 6128, 6126, 5122 4114 4112]


My picks for best Xeon Scalable Skylake-SP processors (by Usage)

Processor IDPriceCoresBase
Clock
Max
Turbo
All
Core
AVX2AVX512#FMACacheCache
per Core
Mem
Clock
TDP
8168 5890 24 2.70 3.70 3.4 3.0 2.5 2 33 1.375 2666 205
6154 3543 18 3.00 3.70 3.7 3.3 2.7 2 24.75 1.375 2666 200
6148 3078 20 2.40 3.70 3.1 2.6 2.2 2 27.5 1.375 2666 150
6144 2925 8 3.50 4.20 4.1 3.5 2.8 2 24.75 3.094 2666 150
6140 2451 18 2.30 3.70 3.0 2.6 2.1 2 24.75 1.375 2666 140
6134 2220 8 3.20 3.70 3.7 3.4 2.7 2 24.75 3.094 2666 130
6130 1900 16 2.10 3.70 2.8 2.4 1.9 2 22 1.375 2666 125
6128 1697 6 3.40 3.70 3.7 3.6 2.9 2 19.25 3.208 2666 115
6126 1776 12 2.60 3.70 3.3 2.9 2.3 2 19.25 1.604 2666 125
5122 1227 4 3.60 3.70 3.7 3.6 3.3 2 16.5 4.125 2666 105
4114 704 10 2.20 3.00 2.5 2.2 1.4 1 13.75 1.375 2400 85
4112 483 4 2.60 3.00 2.9 2.6 1.4 1 8.25 2.0625 2400 85

Notes: The price is is Intel's suggested price in USD. The CPU clocks are in GHz, the Memory clock is in MHz, the Cache sizes are in MB, and the TDP is in Watts.

Here's a breakdown along the use-case ides I presented near the top of the post.

Serial CPU (Single Thread)

  • 6144, 6128, 5122 These have 8, 6 and 4 cores with high Turbo frequencies and larger caches. They will all give excellent single thread performance and they maintain respectable clock frequency if many job are run at the same time. They would also have good parallel performance for jobs that had limited scalability.

Multi-Threadded (Non-Vectorized)

  • 8168 6154, 6148, 6140, 6130, 6126, 4114 These have 24, 18, 20, 18, 16, 12, and 10 cores. They have good All-Core-Turbo speeds. My favorite would be a dual 6154, it has great performance and value.

Mulit-Threaded Vectorized (Highly Optimized Software)

  • 8168, 6154, 6148, 6140, 6130, 6126 These are mostly the same as above! These processors all have good AVX-512 All-Core-Turbo frequency and high core count.

Memory I/O Bound

  • 6144, 6134, 5122 or 8168, 6148, 6140 The first group has large Cache per Core, and the second has higher core count and larger overall shared L3 Cache.

CPU doesn't Matter

  • 6128, 5122, 4114, 4112 The first 2 would be excellent for supporting a system using GPU's for compute where there is still need for fast CPU processing and fast memory transport to and from the GPU's. The last 2 are simply the lowest cost processors on my list and would be good when CPU really doesn't matter much.

There you have it! Those are my picks and recommendation. You may have a different opinion! Hopefully the charts and tables give you something to consider if you have a use case that I didn't include.

One thing that I didn't discuss here is parallel scaling and Amdahl's Law. This is important! I recommend that you look a the post I did for the Broadwell Xeons. There is an interactive chart in that post that painfully shows how less-than-perfect scaling can effect on high core count processor system performance. Intel Xeon E5 v4 Broadwell Buyers Guide (Parallel Performance)


Useful Links

  • The full list of Intel Xeon Scalable processors on Intel Ark. Pro tip: If you want a full spreadsheet with all of the Ark specifications, then click the check-box for "All" on the "compare tab". You can then click "Compare" and you will see an option for "Export comparison". That will give you an XML spreadsheet that you can load into something like Excel.

  • For technical details on the processors and errata see "Intel Xeon Processor Scalable
    Family Specification Update"
    . This pdf document has loads of information including tables of all of the different clock frequencies far any number of active cores for all for the processors.

Happy computing! --dbk

Tags: Intel Xeon, Scalable Processors, Skylake-SP, Purley, Buyers Guide
James

I keep wondering when these processors will actually start being available to consumers. I've been holding off on buying a new workstation for about a year now, having heard these were going to be released. But even though Intel launched them five months ago, you can't really find them being sold anywhere. What's more, the ones that really do offer good price-performance value (e.g. 6154) are tray CPUs, which are always hard to come by if you want to build your own workstation and not go with something pre-built... /:

Posted on 2017-12-08 19:10:46
Janna

On that note, any estimate on when Puget will start to carry these CPUs on your workstations?

Posted on 2017-12-15 16:55:18
Donald Kinghorn

Sorry I missed your comments! I should get notifications now ...
We have started testing motherboards and processors but they haven't gotten to the point where I have any in my hands for testing. That will happen once we get some systems validated. I expect that to happen in a couple of weeks.

We don't have much demand for the Xeon Skylake-SP yet since the Skylake-X is such a fantastic processor (same basic core as the new Xeons... AVX512 is wonderful!). I would take a single socket 7980XE over a dual socket 2690v4 at this point. We will have to limit the number of the new multi-socket Xeon's that we will carry since there are so many of them. I am anxious to try a good dual socket setup. I would like to try Intel's new machine learning libraries.

I was at SC17 in November and I talked with a lot of people about the new hardware. Many people are waiting to see how AMD EPYC is going perform before they do a new Xeon purchase. I'm curious about that too!

Thanks --Don

Posted on 2018-01-11 05:05:02
Dragon

So what I am getting out of all this is that both Intel and AMD pre-announced their new server processors by at least 6 months. BTW, a xeon W with anything more than 10 cores seems to still be a fictional beast. Maybe 140W was too optimistic since the 165W core x processors are essentially the same and readily available. Am I missing something?

Posted on 2018-01-19 02:33:19

I'm not sure if they're available for sale yet, but we do have samples here in-house of the 14-core Xeon W (model 2175).

Posted on 2018-01-19 05:08:31
Dragon

That sounds about right. They announced the 14 core later than the 10 and the 18, but maybe it comes out before the 18 because of a better chance of making the stated TDP. In any case, it is the wimpiest Xeon workstation launch in history. Also, no Mobos to be found for those processors, so the chipset must be in the same limbo. You can at least find a 32 core Epyc on Newegg.

Posted on 2018-01-19 05:34:22
Donald Kinghorn

Hey Dragon, a couple of things ... first I can tell you first hand the all of the Skylake-X and -W processors are really nice in general and particularly for anything that can take advantage of the AVX512 vector units. I have been using the 14-core 2175W that isn't released yet. Really like it! I think even at over $2K either the -X or -W 18 core are a good value (and so are the lower core count processors!) I'm sure Xeon-Scalable will be great performers but deciding what to use is a headache. (and in general they are over priced)

You are right about the new processors launch! (and the motherboards, ...the Skylake-X boards were released with nasty bad BIOS screw-ups) The processors are great but the launch was pathetic! ... by both of them! SC17 had a strange vibe of dissatisfaction and excitement at the same time. ARM got an enormous amount of attention! Strange times... I think that part of the availability trouble is that first runs are going to the cloud providers and big users that are updating from large Xeon V3 and V4 deployments. It should get better soon ... but no-one will want to stock them because of the large number of SKU's and high price.

Posted on 2018-01-19 17:11:26
Dragon

It is a weird time. The focus is to fawn over Google and other cloud providers and to heck with anyone who just wants to buy or build a good high performance computer. The whole cloud concept is very fragile and potentially dangerously invasive to privacy and those issues are becoming more visible with time. If the market switches back, which I suspect it well might, neither Intel nor AMD has done much to win friends in the general business community.
One of the reasons for serving the big boys first, could well be that they want the performance badly enough to not worry about the actual TDP and that gives the process time to mature before general release. I do like the performance numbers on the Skylake-X and -W processors, but I think I will wait a little longer for the dust to settle and maybe the Volta GTX cards will be out by then.

Posted on 2018-01-19 18:24:31
Donald Kinghorn

*I HAD A MISTAKE IN THE "LARGE L3 CACHE" PLOT AND TABLES* **NOW FIXED**

I had the Cache-Per-Core value for the 6128 as 2.308 when it should have been 3.208
I rechecked ALL of those values and made a correction in my database csv file and re-ran the calculations and generated new plots.

That significantly moves the 6128 to the right in the plot i.e. higher performance. That moves it to the right of the 6126 12-core with 1/2 the cache/core

The 6128 looks like a really good processor for job that would benefit form large per core cache. The 5122 still leads in that regard with the largest per core cache of all of the processors.

We are finally getting systems with these processors into our systems but they are still somewhat hard to get! Craziest Xeon launch ever for Intel.

I should note that the "Purley" core is great for compute It is certainly the best processor Intel has produced. This same basic core is available in, Xeon Scalable [ Skylake-SP], Xeon W [ Skylake-W ], and Core-i7/i9 [ Skylake-X ]. These are all wonderful processors! I wouldn't hesitate to recommend any of them, just go with your needs and budget
--Don

Posted on 2018-02-26 18:15:25
Jeff

Great article! FYI, I linked to it from a recent blog I did on a different spin - comparing processors against generations.

https://www.linkedin.com/pu...

Posted on 2018-08-28 21:17:34
Donald Kinghorn

Thanks Jeff, Purley was the craziest CPU launch ever. Some great processors but trying to narrow down the way-to-many SKU's was a real chalange!

Posted on 2018-08-30 00:07:25
Nick Lam

Which of these CPUs support dual sockets?

Posted on 2018-10-24 08:03:27

I believe all of the Xeon Scalable processors support running in dual socket configurations, with some of the higher end models also supporting 4- and 8-socket setups.

Citation: https://ark.intel.com/produ... (even the lowest-end Bronze 3104 lists 2S support)

Posted on 2018-10-24 15:37:50