Intel Xeon E5 v4 Broadwell Buyers Guide (Parallel Performance)

This is an update for Xeon Broadwell processors of an older post that was done as a guide to the Xeon E5v3 Haswell processors. In this update the data has been refreshed to use the All-Core-Turbo CPU clock speeds and new core-counts for the Broadwell Xeon processors. Note that all-core-turbo is a much better performance parameter than CPU base-clock frequency since it is the frequency the the processors actually run at under full load!

For this update I have moved the discussion of Amdahl’s Law and parallel performance to the end of the post. If you don’t feel like reading then just scroll down to the bar chart and click some of the buttons and then realize what you are looking at :- )

An Intel CPU based workstation with proper power and cooling WILL run at all-core-turbo clock frequency under full load. So why doesn’t Intel just report all-core-turbo as the processor clock frequency? It looks like the base-clock is the frequency at which the stated TDP power usage is achieved. In order to get the the processors to run at the bases clock you have to disable “turbo” in the BIOS. It’s surprisingly difficult to find all-core-turbo frequency information. Most spec lists just report base and max-turbo clock frequencies, both of which are not very useful. I was able to get all-core-turbo for all of the v4 processors list in this post except the E5v4 1620, 1630 and 1650. For those processors I’ve just kept the v3 information for those and will update if/when I can find the proper frequencies.

In the following chart 26 E5 v4 processors are listed in decreasing cost order (cost of two 26xx CPUs or one 16xx CPU) The bar length corresponds to the theoretical peak performance UNDER THE INFLUENCE OF AMDAHL’S LAW!

click button to change Amdahl’s Law parallel fraction for performance scaling

Ordered by price from high to low. 26xx is price for 2 CPU’s 16xx is price for 1 CPU

When I first looked at this chart I was shocked! It doesn’t tell the whole story though. There are are other general usage considerations. Also note that some of the processors have a larger “smart cache” per core and that can have a big influence on codes that are slowed down by cache misses etc.. I have a table of the processors with some of their features listed at the bottom of this post.

Amdahl’s Law

The most important consideration when configuring a system for optimal parallel performance is the process scaling of your program. You need to have some idea of how many cores can be effectively utilized in order to make an informed decision about your system configuration. If you only want to run one job at a time using all cores on your system then you need to know how many processes your program will scale to before parallel scaling degradation limits your performance gains. If you know that your code only scales well to 8 process you need to decide if you want to just configure an 8 core machine or configure a machine that will let you run several of these 8 core jobs at the same time. … see “Other Considerations”.

To get the idea of Amdahl’s law consider this: If you have a single threaded program and you can find a section of the code that uses 90% of the time and you can make that part of the code run in parallel (perfectly), even though that sounds good, your program will never be more than ten times faster, no matter how many cores you use! [You might want to take a look at Matt’s article about estimating performance with Andahl’s Law.]

If your code scaling is not great then you are likely better off with fewer cores running at higher clock frequencies. If you code scales really well then you will likely benefit from a higher core count.

The following chart shows the Amdahl’s Law curves up to 36 cores for 7 different parallel fractions, P, ranging from 1 to 0.95, i.e. from perfect linear scaling to 95% of execution time in parallel (maximum speedup = 20).

speedup = 1/( (1-P) + P/n )

where P is the parallel fraction and n is the number of processes (cores)

Notice how the speedup falls off with increasing core count. Just because your program runs almost 4 times faster with 4 cores does not mean it will run 36 times faster with a dual 18-core system.

To evaluate processor performance under the influence of Amdahl’s Law, observe that the speedup is the “effective” core count. If we calculate the the theoretical performance of a systems using this “effective core count” we get a much better picture of potential “real world” performance.

performance = "Effective core count" * All-core-turbo Clock (GHz) * Special ops. i.e. AVX2, FMA3 (16)

for a dual E5-2699v4 system with perfect parallel scaling, P=1,that would be

performance = 44 * 2.8 * 16 = 1971.2 GFLOPS

Now at a parallel fraction of .95 Amdahl's law gives us;

effective number of cores = 1/( 1-.95)  +  .95/44)  = 13.97

this give a performance at P = .95 of

performance(P=.95) = 13.97 * 2.8 * 16 = 625.78 GFLOPS

Other considerations

The three primary ways to utilize a multi-core system are;

  • Run single parallel jobs with all available cores
  • Take advantage of the increased core count to facilitate larger problem sizes
  • Run multiple, “single” or “few” process jobs

These three use cases are governed by the following;

  • Parallel performance characterized by Amdahl’s Law ( we looked at this above)
  • Parallel performance according to Gustafson’s Law
  • Efficient job scheduling

Job Scheduling

You can treat a high core count workstation as a replacement for a small cluster. Set it up with a job scheduler, create a queue and load up your jobs. You may have some job runs with single threaded code and some jobs that really can’t take advantage of more than a couple of parallel processes. Let the scheduler balance the load. This can be a great way to get good utilization out of your system. In this case your choice of processors may be dictated more by your budget than anything else. If you have the resources you can go with a dual 18 or 14 core processor and get to work. Modern job schedulers are “parallel aware” so you can run mixed job types. Setting up a job scheduler is not always trivial but can certainly be worth the effort. Examples are SLURM, Grid Engine, Torque, PBS etc..

Gustafson’s Law

The next case that can be facilitated by a many-core workstation is running “larger” problems than you could with a less capable system. This is the realm of Gustafson’s Law.

The ideal case for Gustafson’s law on a workstation is when having twice as many cores means you can run a job that is twice the size in the same amount of time. (You will likely need at least twice the memory too!) You have to be careful when considering this type of scaling (weak scaling) on a single node workstation since larger problems can be limited by memory performance. On a cluster distributing a larger parallel job over several nodes will benefit from a more even distribution over cache and memory controllers and this can sometimes make a big difference in parallel performance and can occasionally result in “super linear” scaling because of the better memory utilization. On a single node many-core workstation you do get the extra cache associated with the cores but the number of memory controllers is fixed.

On a many-core workstation you are more likely to be limited by the Amdahl’s Law performance of your code regardless of the problem size. However, if you are looking to increase the size of the problems that you look at, lots of cores and lots of memory are your friends!

Intel Xeon E5 V4 CPU Specs

Processor Base
Cores Smart
Per Core
TDP Price*
E5-2699 v4 2.2GHz 2.8GHz 22 55MB 2.5MB 145W $4,115
E5-2698 v4 2.2GHz 2.7GHz 20 50MB 2.5MB 135W $3,226
E5-2697A v4 2.6GHz 3.1GHz 16 40MB 2.5MB 145W $2,891
E5-2697 v4 2.3GHz 2.8GHz 18 45MB 2.5MB 145W $2,702
E5-2695 v4 2.1GHz 2.6GHz 18 45MB 2.5MB 120W $2,424
E5-2687W v4 3.0GHz 3.2GHz 12 30MB 2.5MB 160W $2,141
E5-2690 v4 2.6GHz 3.2GHz 14 35MB 2.5MB 135W $2,090
E5-2667 v4 3.2GHz 3.5GHz 8 25MB 3.125MB 135W $2,057
E5-2683 v4 2.1GHz 2.6GHz 16 40MB 2.5MB 120W $1,846
E5-2658 v4 2.3GHz 2.5GHz 14 35MB 2.5MB 105W $1,832
E5-2680 v4 2.4GHz 2.9GHz 14 35MB 2.5MB 120W $1,745
E5-2643 v4 3.4GHz 3.6GHz 6 20MB 3.33MB 135W $1,552
E5-2660 v4 2.0GHz 2.4GHz 14 35MB 2.5MB 105W $1,445
E5-2650v4 2.2GHz 2.5GHz 12 30MB 2.5MB 105W $1,166
E5-2637 v4 3.5GHz 3.6GHz 4 15MB 3.75MB 135W $996
E5-2640 v4 2.4GHz 2.6GHz 10 25MB 2.5MB 90W $939
E5-1680 v4 3.4GHz 3.6GHz 8 20MB 2.5MB 140W $1,723
E5-2630 v4 2.2GHz 2.4GHz 10 25MB 2.5MB 85W $667
E5-1660 v4 3.2GHz 3.4GHz 8 20MB 2.5MB 140W $1,080
E5-2623 v4 2.6GHz 2.8GHz 4 10MB 2.5MB 85W $444
E5-2620 v4 2.1GHz 2.3GHz 8 20MB 2.5MB 85W $417
E5-2609 v4 1.7GHz 1.7GHz 8 20MB 2.5MB 85W $306
E5-1650 v4 3.6GHz 3.8GHz 6 15MB 2.5MB 140W $583
E5-2603 v4 1.7GHz 1.7GHz 6 15MB 2.5MB 85W $213
E5-1630 v4 3.7GHz 3.8GHz 4 10MB 2.5MB 140W $372
E5-1620 v4 3.5GHz 3.6GHz 4 10MB 2.5MB 140W $294

* Price from Intel ARK

** All-Core-Turbo clock unknown at time of this post.

Happy computing! –dbk