Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/599
Dr Donald Kinghorn (Scientific Computing Advisor )

Intel Xeon E5 v3 Haswell-EP Buyers Guide

Written on October 3, 2014 by Dr Donald Kinghorn
Share:

Intel Xeon E5 V3 Haswell-EP Buyers Guide based on Parallel Performance Governed by Amdahls Law

( If you don’t feel like reading then just scroll down to the second chart and click some of the buttons and then realize what you are looking at :- )

The new Xeon E5 v3 Haswell processors are here, all 30+ of them! There is a bewildering variety of clock speeds, core counts, and power usage. There are processors in the new v3 familly ranging from the single socket E5-1620v3 with 4 cores at 3.5 GHz to the dual socket E5-2699v3 with 18 cores at 2.3GHz. How do you make a choice for a new system?!

Assuming you have some programs with a reasonable parallel implementation, multi-threaded or with message passing i.e MPI, the factor that is likely to be most important is the application speedup with increasing number of cores, i.e. parallel scaling.

Amdahl’s Law

The most important consideration when configuring a system for optimal parallel performance is the process scaling of your program. You need to have some idea of how many cores can be effectively utilized in order to make an informed decision about your system configuration. If you only want to run one job at a time using all cores on your system then you need to know how many processes your program will scale to before parallel scaling degradation limits your performance gains. If you know that your code only scales well to 8 process you need to decide if you want to just configure an 8 core machine or configure a machine that will let you run several of these 8 core jobs at the same time.

To get the idea of Amdahl’s law consider this: If you have a single threaded program and you can find a section of the code that uses 90% of the time and you can make that part of the code run in parallel (perfectly), (even though that sounds good), your program will never be more than ten times faster, no matter how many cores you use!

If your code scaling is not great then you are likely better off with fewer cores running at higher clock frequencies. If you code scales really well then you will likely benefit from a higher core count.

The following chart shows the Amdahl’s Law curves up to 36 cores for 7 parallel fractions, P, ranging from 1 to 0.95, i.e. from perfect linear scaling to 95% of execution time in parallel (max speedup = 20).

speedup = 1/( (1-P) + P/n ) 

where P is the parallel fraction and n is the number of processes (cores) 

Notice how the speedup falls off with increasing core count. Just because your program runs almost 4 times faster with 4 cores does not mean it will run 36 times faster with a dual 18-core system.

To evaluate the new E5 v3 processors under the influence of Amdahl’s Law, observe that the speedup is the “effective” core count as far as performance goes. To estimate the relative performance of the new processors we use the theoretical peak double precision floating point performance measured in GFLOPS.

performance = CPU cores * sockets * Clock speed (GHz) * AVX2 vector length and FMA3 (16)  

for a dual E5-2699v3 system that would be 

performance = 18 * 2 * 2.3 * 16 = 1324.8 GFLOPS

Now at a parallel fraction of .95 Amdahl’s law gives us;

effective number of cores = 1/( 1-.95)  +  .95/36)  = 13.1

this give a performance at P = .95 of

performance(P=.95) = 13.1 * 2.3 * 16 = 482 GFLOPS

In the following chart 27 of the new E5 v3 processors are listed in decreasing cost order (cost of two 26xx CPUs or one 16xx CPU) The bar length corresponds to the theoretical peak performance UNDER THE INFLUENCE OF AMDAHL’S LAW!

 

click button to change Amdahl's Law parallel fraction for performance scaling

* Ordered by price from high to low. 26xx is price for 2 CPU's 16xx is price for 1 CPU


When I first looked at this chart I was shocked! It doesn’t tell the whole story though. There are are other general usage considerations. Also note that some of the processors have a larger “smart cache” per core and that can have a big influence on codes that are slowed down by cache misses etc.. I have a table of the processors with some of their features listed at the bottom of this post.

Other considerations

The three primary ways to utilize a multi-core system are;

  • Run single parallel jobs with all available cores
  • Take advantage of the increased core count to facilitate larger problem sizes
  • Run multiple, “single” or “few” process jobs

These three use cases are governed by the following;

  • Parallel performance characterized by Amdahl’s Law ( we looked at this above)
  • Parallel performance according to Gustafson's Law
  • Efficient job scheduling


Job Scheduling

You can treat a high core count workstation as a replacement for a small cluster. Set it up with a good job scheduler, create a queue and load up your jobs. You may have some job runs with single threaded code and some jobs that really can’t take advantage of more than a couple of parallel processes. This can be a great way to get full utilization out of your system and in this case your choice of processors to use may be dictated more by your budget than anything else. If you have the resources you can go with a dual 18 or 14 core processor and get to work. Modern job schedulers are “parallel aware” so you get be running a mix is jobs. Setting up a job scheduler is not always trivial but can certainly be worth the effort. Examples are SLURM, Grid Engine, Torque, PBS etc..

Gustafson’s Law

The next case that can be facilitated by a many-core workstation is running larger problems than you could do with a less capable system. This is the realm of Gustafson’s Law.

The ideal case for Gustafson’s law on a workstation is when having twice as many cores means you can run a job that is twice the size in the same amount of time. (You will likely need at least twice the memory too!) You have to be careful when considering this type of scaling (weak scaling) on a single node workstation since larger problems can be limited by memory performance. On a cluster distributing a larger parallel job over several nodes will benefit from a more even distribution over cache and memory controllers and this can sometimes make a big difference in parallel performance and can occasionally result in “super linear” scaling because of the better memory utilization. On a single node many-core workstation you do get the extra cache associated with the cores but the number of memory controllers is fixed. We do have the advantage of DDR4 with the new Xeon E5 v3 processors and this provides significantly more memory bandwidth.

On a many-core workstation you are more likely to be limited by the Amdahl’s Law performance of your code regardless of the problem size. However, if you are looking to increase the size of the problems that you look at, lots of cores and lots of memory are your friends!

Intel Xeon E5 V3 CPU Specs

 
Processor Clock Cores Smart Cache Cache/Core TDP Price*
E5-2699 V3 2.3GHz 18 45MB 2.5MB 145W $4,109
E5-2698 V3 2.3GHz 16 40MB 2.5MB 135W $3,220
E5-2697 V3 2.6GHz 14 35MB 2.5MB 145W $2,702
E5-2695 V3 2.3GHz 14 35MB 2.5MB 120W $2,424
E5-2687 V3 3.1GHz 10 25MB 2.5MB 160W $2,141
E5-2690 V3 2.6GHz 12 30MB 2.5MB 135W $2,090
E5-2667 V3 3.2GHz 8 20MB 2.5MB 135W $2,057
E5-2683 V3 2.0GHz 14 35MB 2.5MB 120W $1,846
E5-2658 V3 2.2GHz 12 30MB 2.5MB 105W $1,832
E5-2680 V3 2.5GHz 12 30MB 2.5MB 120W $1,745
E5-2670 V3 2.3GHz 12 30MB 2.5MB 120W $1,589
E5-2643 V3 3.4GHz 6 20MB 3.33MB 135W $1,552
E5-2660 V3 2.6GHz 10 25MB 2.5MB 105W $1,445
E5-2650 V3 2.3GHz 10 25MB 2.5MB 105W $1,166
E5-2637 V3 3.5GHz 4 15MB 3.75MB 135W $996
E5-2640 V3 2.6GHz 8 20MB 2.5MB 90W $939
E5-1680 V3 3.2GHz 8 20MB 2.5MB 140W $1,723
E5-2630 V3 2.4GHz 8 20MB 2.5MB 85W $667
E5-1660 V3 3.0GHz 8 20MB 2.5MB 140W $1,080
E5-2623 V3 3.0GHz 4 10MB 2.5MB 105W $444
E5-2620 V3 2.4GHz 6 15MB 2.5MB 85W $417
E5-2609 V3 1.9GHz 6 15MB 2.5MB 85W $306
E5-1650 V3 3.5GHz 6 15MB 2.5MB 140W $583
E5-2603 V3 1.6GHz 6 15MB 2.5MB 85W $213
E5-1630 V3 3.7GHz 4 10MB 2.5MB 140W $372
E5-1620 V3 3.5GHz 4 10MB 2.5MB 140W $294

* Price from Intel ARK

Happy computing! --dbk

Tags: Xeon E5 v3, Haswell-EP, Amdahl\'s law, buyers guide
Frank Davidson

Very nice. Thanks!

Posted on 2014-10-08 21:28:25
Ramos

Donald, great article as usual! I wanted to ask a question that came up, that I cannot find a definitive answer for anywhere (from credible sources), even if I compare them on.. http://ark.intel.com/compar...
..and try to figure out what X got that Y doesn't:

E7v3* vs E5v3?

- When to go with what? (thinking dual systems only, or using E7-48xx's as they are cheaper for smaller systems)

- What kind of computational work would be best suited for going E7(v3)'s and in what professional situations is it best to just say, E5(v3) is good enough.

(Thinking HPC, Big Data, DB, in-memory analysis but not trusting Intel propaganda videos blindly)

I think this would make a great in-depth article btw ;)

I can see the little extra cache and assuming the DDR4, AVX2 and so on for E7v3 and E5v3 will be the same, then the sheer speed from CPU to memory seems to be the key difference,
68GB/s (for a v2 using DDR3 even) > E5v3 at 59GB/s
and
3xQPIs at 7.2GT/s (again for the E7v2) > 2xQPI at 8GT/s.
Can I conclude that E7's only are for very intense in-memory stuff such as SAP Hana and Apache Spark or MongoDB(also in memory if room) etc?

How will an E7v3 scenario change with Knights Landing and its superfast memory?...and will it change the scenario or obsoletify the E7 series?....or will the E7 still rule from KL's upper memory limit up to the 1TB RAM limit for E7 boards? (well SuperMicro has one with 6TB via extender cards but still)
(tia)
*: (when they come out very shortly)
http://www.cpu-world.com/ne...

Posted on 2015-02-20 00:17:05
Donald Kinghorn

Thanks Ramos, this post was fun to write and the chart changes with scaling fraction was a real eye opener!

I really haven't had any time on an E7 system. We have looked at them and quoted some but they are very expensive and fit in a usage class that I don't run into very often. The most interesting characteristic is the big memory support. We've had some engineering application user with really large model spaces that have wanted the them for the memory space. I also think you are correct about use cases for large in-memory analysis applications. That's really where they make the most sense. It's hard to find build components other than the Supermicro base that you've seen. They don't have that much appeal to me personally.

The processor that I'm to most excited about really is the 46xx quad socket E5 v3. We still haven't gotten word on when they will be available but I hope to have samples soon. I really like the Ivy Bridge quad systems that we have been doing. I've been impressed with the memory performance and how well many codes scale on them. It makes a nice alternative to a small cluster in my opinion. It will be nice to have the beefier AVX2 vector units of the v3.

Best wishes --Don

Posted on 2015-02-20 00:45:41
Ramos

Wow, thanks for a very quick answer!

I agree with the E7v3s, sadly, I wanted to think I'd need em, I think. But the E5v3's should definitively fit my need (soon).

The E7v3's will just as the current v2's have a focus on the mission critical uptime stuff, rather than brute performance,

http://bladesmadesimple.com...

The E5's of the same gen/clock/cores, are even faster at the HPC pack benchmarks it seems, however one can swap anything while it runs with an E7, even CPU slots and memory. And then it basically runs some kinda RAID1 or even RAID5 on memory and even some CPU stuff. Very advanced.

--------------

If you don't mind I had some other questions that came up:

Have you or your colleagues had any experience with user scenarios other than HPC with these E5v3's?

- Spark outside Hadoop, in for example MemSQL, which I find very fascinating at the moment for complete real-time-in-mem analytics on streaming data for example.

- Storm on Hadoop, the streaming processing plugin that is louded to be extremely fast.

- Word2Vec from Google. Text similarity machine learning.
- Virtualization for private clouds and being master for these nodes(in Hadoop its again stored in memory till backuped to disks), how well do they scale to fill a role and such.

Do you think the HDFS via YARN will be the future of HPC (implementing jobs with tasks in something faster and more mathmatically precise than Java) and leave true Intel/Fortran HPC to the supercomputers or will there still be an unchanged market for HPC?

--------------

Can we expect an article about Knight's Landing soon?

I am very curious about the whole new concept after reading about the complete self-hosting feature and how they can host 384GB RAM each, with 16GB ultra fast, basically Cache3 MCDRAM at 500GB/s.

- Do you feel this can make the KL's replace huge memory focused servers (ie with some but not a lot of CPU cores, but over 512GB RAM) for in-memory analytics? (Thinking a 3xKL server will already have 900GB ram and will likely be cheaper and cheaper in Watts as well for HPC...AND you get the 764 threads for free on top, even if you don't immidiately need them)

- Do you have a feeling of how important the data locality will be now, that the individual nodes can speed up computations in some cases?

- I had a crazy thought: You think there will be a Hadoop+YARN Ecosystem clone for pure in-memory work on a KL cluster, but implemented in Intel C or most efficient language there is?
Having each KL as a Node and using a CPU as a master or even using each of the KL's as NameNode's and RessourceMasters for all the other DataNodes and then distributing jobs and local data for the computations like that?
(It sounds brilliant to re-use the optimized design for KL clusters in my mind but I may have forgotten something)

Posted on 2015-02-21 20:02:28
Donald Kinghorn

Hey Ramos,
Great (detailed) questions! Unfortunately I don't have great detailed answers :-)

I'm personally most involved with "standard" scientific computing. I haven't spent a lot of time on storage and data analytics. However, I am very interested and feeling more motivated because there are so many interesting problems to solve from both a sys admin and scientific computing viewpoint. I did some what would now be called "machine learning" years ago and have wanted to revisit this again. High speed data streaming and analysis on GPU is what I'm thinking about ...

I don't have any more info on Knight's Landing than you do at this point. However, when we do get something from Intel I will surely write about it!

Thanks! --Don

Posted on 2015-02-24 18:46:41
Ashu

i bought z10pe d16 ws motherboard and i was checking compatible CPUs but i am not seeing the below CPU in the above list please advise
Intel Xeon Processor E5-2690 v3 CPU 2.4GHz 12-Core 135W 30M Max 3.0GHz QEYJ ES

Posted on 2016-06-06 04:38:38