Haswell Floating Point PerformanceWritten on August 26, 2013 by Dr. Donald Kinghorn
I was doing some system testing on a nicely configured dual Intel Xeon node using the optimized Linpack benchmark1 from the Intel MKL2 (Math Kernel Library) and was really impressed with the performance numbers I was seeing. I had a couple of new Haswell systems set up so decided to run the same benchmark code on those too. I was stunned by the performance of the Haswell processors! I tested a few other systems and found that Haswell gives about a 70% speedup over Ivy Bridge at the same core clock speed. I was surprised by that because what I was hearing about Haswell was that it was generally only about 10% (or less) faster than Ivy Bridge. It looks like the new AVX23 instructions on Haswell are really effective.
On Intel hardware the numerical routines in the MKL library are highly optimized to take full advantage of every bit of performance you can get out of the hardware and most of the algorithms are SMP parallel using OpenMP4 threads. Running the linpack benchmark included with MKL will give you a good idea of what your system is really capable of as far as numerical computing goes. It also gives you good motivation for making use of those library routines in you own code when you can, since a lot of the heavy performance optimization is already done for you.
Table 1 -- Linpack benchmark using the Intel MKL library
|1||Dual Xeon E5 2687||16 cores @ 3.2GHz AVX||345 GFLOPS|
|2||Dual Xeon E5 2650||16 cores @ 2.0GHz AVX||262 GFLOPS|
|3||Core i7-4770K (Haswell)||4 cores @ 3.5GHz AVX2||177 GFLOPS|
|4||Xeon E3 1245v3 (Haswell)||4 cores @ 3.4GHz AVX2||170 GFLOPS|
|5||Core i5 3570 (Ivy Bridge)||4 cores @ 3.4GHz AVX||105 GFLOPS|
|6||Core i7 920||4 cores @ 2.66GHz SSE4.2||40 GFLOPS|
If you are the type of person who knows what GFLOPS5 means, and cares about it, then look at those numbers again and realize what you are seeing. Your heart rate should be a bit higher now. OK, OK, I know there is a big difference between a dual E5 Xeon compute node and a desktop Core i7 machine but, hey, you can run some serious compute jobs on a simple Core i7 desktop system with 32GB of RAM and a nifty SSD!
If you suffer from "number blindness" and prefer a chart, here you go;
Notes for Table 1:
- The number of cores refers to the actual real cores in the processor. "Hyperthreading" was enabled on those systems during the benchmark runs so the systems would report twice as many "cores" as are listed here. In table 2 we will see that for heavy numerical computation Hyperthreading is less than useless. Also note, the Core i5 processor doesn't have Hyperthreading.
- Systems 1,2,3,4 were running CentOS 6.4 Linux. System 5 was running Ubuntu 13.04 and system 6 was running Fedora 18. The underlying Linux system for this benchmark isn't really going to make any significant difference.
- AVX2 is probably the real star of the show here since it is in essence a 256bit per core vector processing unit i.e. it executes multiple floating point instructions per clock cycle. Haswell has the latest iteration of this technology which is in general a newer replacement for what SSE instruction were used for in the past.
- The problem size used for most of the results had the leading dimension of the matrix A of 25000. This used approximately 5GB of memory. I could have used a larger problem size but I just happened to be using that size for testing some Xeon Phi cards too :-) and it's a large enough problem size to give a good representation of the systems performance and takes about 1 minute on a Core i7 4770. The problem size for the job run on system 1 was actually 45000. I used that in the table since it was the only number I recorded of the testing runs I did on that machine.
The really impressive feature of these performance numbers is the implied performance per core. Just dividing the GFLOPS by the number of cores in a system is NOT a good way to judge per core performance but it does convey the pulse quickening information that I referred to earlier. For the actual thread scalling performace see table 2 below.
I was so impressed with the performance of the Core i7 4770 that I bought one for myself! I built a new home system to replace my aging (vintage 2008) Core i7 920 system (system number 6 in table 1)
To round out these numbers, and confuse you more, (these are benchmarks after all!) I've Included another table showing the parallel scaling on the Xeon E5 2650 system and my new system with a "non-K" Core i7 4770 (I wanted VT-d capability which the "K" "over clocker" processors don't have).
Table 2 -- OpenMP thread scaling
Dual Xeon E5-2650
16 real cores (GFLOPS)
4 real cores (GFLOPS)
Notes for Table 2:
- You can see what I mean by "less than useless" referring to Hyperthreading for compute problems like Linpack. When it was first introduced it caused significant slow down for numerical computation and it was common practice to disable it in the BIOS. With the modern implementation it doesn't seem to really cause too much trouble so I just leave it enabled just in case there actually is something that benefits from it.
- Something you should keep in mind about modern Intel processors is that they have "Turbo Boost"6 which is basically automatic overclocking when the processor core temperature and current draw is "low enough". That means that when you are putting the system under less than full load, you may get a clock boost on the cores that are actually working. That makes it harder to consistently tell what the real clock speed of a processor core is at any given moment, which complicates things like parallel thread scaling measurements.
- Another thing to note about the numbers for the Core i7-4770 in this table. They were generated with the MKL libraries in "update4" of the the Intel Parallel Studio XE7 which has some new tweaks for AVX2 which boosted the Core email@example.comGHz numbers up to that of the 4770K@3.5GHz.
The new Haswell processors have some serious floating point capability with the new AVX2 instructions ... if you have code that can exploit it!
1 The Linpack benchmark is basically solving a system of linear equations by tri-diagonalization using Gaussian elimination. It's a common type of problem you see in numerical computing and it scales well thus making it a popular, and meaningful benchmark (in my opinion). Wikipedia has a good article on it.
3AVX (Advanced Vector Extensions) and, SSE (Streaming SIMD Extensions) before that, provides a nice SIMD (Single Instruction Multi Data) vector unit to x86 processors. You can get a significant speed up in programs that are optimized to take advantage of it. Wikipedia has a good page on AVX
5 GFLOPS Giga FLOPS i.e billions of Floating point Operations Per Second.