Table of Contents
Intel Xeon E5 V3 Haswell-EP Buyers Guide based on Parallel Performance Governed by Amdahls Law
( If you don’t feel like reading then just scroll down to the second chart and click some of the buttons and then realize what you are looking at :- )
The new Xeon E5 v3 Haswell processors are here, all 30+ of them! There is a bewildering variety of clock speeds, core counts, and power usage. There are processors in the new v3 familly ranging from the single socket E5-1620v3 with 4 cores at 3.5 GHz to the dual socket E5-2699v3 with 18 cores at 2.3GHz. How do you make a choice for a new system?!
Assuming you have some programs with a reasonable parallel implementation, multi-threaded or with message passing i.e MPI, the factor that is likely to be most important is the application speedup with increasing number of cores, i.e. parallel scaling.
The most important consideration when configuring a system for optimal parallel performance is the process scaling of your program. You need to have some idea of how many cores can be effectively utilized in order to make an informed decision about your system configuration. If you only want to run one job at a time using all cores on your system then you need to know how many processes your program will scale to before parallel scaling degradation limits your performance gains. If you know that your code only scales well to 8 process you need to decide if you want to just configure an 8 core machine or configure a machine that will let you run several of these 8 core jobs at the same time.
To get the idea of Amdahl’s law consider this: If you have a single threaded program and you can find a section of the code that uses 90% of the time and you can make that part of the code run in parallel (perfectly), (even though that sounds good), your program will never be more than ten times faster, no matter how many cores you use!
If your code scaling is not great then you are likely better off with fewer cores running at higher clock frequencies. If you code scales really well then you will likely benefit from a higher core count.
The following chart shows the Amdahl’s Law curves up to 36 cores for 7 parallel fractions, P, ranging from 1 to 0.95, i.e. from perfect linear scaling to 95% of execution time in parallel (max speedup = 20).
speedup = 1/( (1-P) + P/n ) where P is the parallel fraction and n is the number of processes (cores)
Notice how the speedup falls off with increasing core count. Just because your program runs almost 4 times faster with 4 cores does not mean it will run 36 times faster with a dual 18-core system.
To evaluate the new E5 v3 processors under the influence of Amdahl’s Law, observe that the speedup is the “effective” core count as far as performance goes. To estimate the relative performance of the new processors we use the theoretical peak double precision floating point performance measured in GFLOPS.
performance = CPU cores * sockets * Clock speed (GHz) * AVX2 vector length and FMA3 (16) for a dual E5-2699v3 system that would be performance = 18 * 2 * 2.3 * 16 = 1324.8 GFLOPS Now at a parallel fraction of .95 Amdahl’s law gives us; effective number of cores = 1/( 1-.95) + .95/36) = 13.1 this give a performance at P = .95 of performance(P=.95) = 13.1 * 2.3 * 16 = 482 GFLOPS
In the following chart 27 of the new E5 v3 processors are listed in decreasing cost order (cost of two 26xx CPUs or one 16xx CPU) The bar length corresponds to the theoretical peak performance UNDER THE INFLUENCE OF AMDAHL’S LAW!
click button to change Amdahl's Law parallel fraction for performance scaling
* Ordered by price from high to low. 26xx is price for 2 CPU's 16xx is price for 1 CPU
When I first looked at this chart I was shocked! It doesn’t tell the whole story though. There are are other general usage considerations. Also note that some of the processors have a larger “smart cache” per core and that can have a big influence on codes that are slowed down by cache misses etc.. I have a table of the processors with some of their features listed at the bottom of this post.
The three primary ways to utilize a multi-core system are;
- Run single parallel jobs with all available cores
- Take advantage of the increased core count to facilitate larger problem sizes
- Run multiple, “single” or “few” process jobs
These three use cases are governed by the following;
- Parallel performance characterized by Amdahl’s Law ( we looked at this above)
- Parallel performance according to Gustafson's Law
- Efficient job scheduling
You can treat a high core count workstation as a replacement for a small cluster. Set it up with a good job scheduler, create a queue and load up your jobs. You may have some job runs with single threaded code and some jobs that really can’t take advantage of more than a couple of parallel processes. This can be a great way to get full utilization out of your system and in this case your choice of processors to use may be dictated more by your budget than anything else. If you have the resources you can go with a dual 18 or 14 core processor and get to work. Modern job schedulers are “parallel aware” so you get be running a mix is jobs. Setting up a job scheduler is not always trivial but can certainly be worth the effort. Examples are SLURM, Grid Engine, Torque, PBS etc..
The next case that can be facilitated by a many-core workstation is running larger problems than you could do with a less capable system. This is the realm of Gustafson’s Law.
The ideal case for Gustafson’s law on a workstation is when having twice as many cores means you can run a job that is twice the size in the same amount of time. (You will likely need at least twice the memory too!) You have to be careful when considering this type of scaling (weak scaling) on a single node workstation since larger problems can be limited by memory performance. On a cluster distributing a larger parallel job over several nodes will benefit from a more even distribution over cache and memory controllers and this can sometimes make a big difference in parallel performance and can occasionally result in “super linear” scaling because of the better memory utilization. On a single node many-core workstation you do get the extra cache associated with the cores but the number of memory controllers is fixed. We do have the advantage of DDR4 with the new Xeon E5 v3 processors and this provides significantly more memory bandwidth.
On a many-core workstation you are more likely to be limited by the Amdahl’s Law performance of your code regardless of the problem size. However, if you are looking to increase the size of the problems that you look at, lots of cores and lots of memory are your friends!
Intel Xeon E5 V3 CPU Specs
* Price from Intel ARK
Happy computing! –dbk