Haswell Floating Point Performance

I was doing some system testing on a nicely configured dual Intel Xeon node using the optimized Linpack benchmark¹ from the Intel MKL² (Math Kernel Library) and was really impressed with the performance numbers I was seeing. I had a couple of new Haswell systems set up so decided to run the same benchmark code on those too. I was stunned by the performance of the Haswell processors! I tested a few other systems and found that Haswell gives about a 70% speedup over Ivy Bridge at the same core clock speed. I was surprised by that because what I was hearing about Haswell was that it was generally only about 10% (or less) faster than Ivy Bridge. It looks like the new AVX2³ instructions on Haswell are really effective.

On Intel hardware the numerical routines in the MKL library are highly optimized to take full advantage of every bit of performance you can get out of the hardware and most of the algorithms are SMP parallel using OpenMP⁴ threads. Running the linpack benchmark included with MKL will give you a good idea of what your system is really capable of as far as numerical computing goes. It also gives you good motivation for making use of those library routines in you own code when you can, since a lot of the heavy performance optimization is already done for you.

Table 1 — Linpack benchmark using the Intel MKL library

1	Dual Xeon E5 2687	16 cores @ 3.2GHz AVX	345 GFLOPS
2	Dual Xeon E5 2650	16 cores @ 2.0GHz AVX	262 GFLOPS
3	Core i7-4770K (Haswell)	4 cores @ 3.5GHz AVX2	177 GFLOPS
4	Xeon E3 1245v3 (Haswell)	4 cores @ 3.4GHz AVX2	170 GFLOPS
5	Core i5 3570 (Ivy Bridge)	4 cores @ 3.4GHz AVX	105 GFLOPS
6	Core i7 920	4 cores @ 2.66GHz SSE4.2	40 GFLOPS

If you are the type of person who knows what GFLOPS⁵ means, and cares about it, then look at those numbers again and realize what you are seeing. Your heart rate should be a bit higher now. OK, OK, I know there is a big difference between a dual E5 Xeon compute node and a desktop Core i7 machine but, hey, you can run some serious compute jobs on a simple Core i7 desktop system with 32GB of RAM and a nifty SSD!

If you suffer from "number blindness" and prefer a chart, here you go;

Linpack performance

Notes for Table 1:

The number of cores refers to the actual real cores in the processor. "Hyperthreading" was enabled on those systems during the benchmark runs so the systems would report twice as many "cores" as are listed here. In table 2 we will see that for heavy numerical computation Hyperthreading is less than useless. Also note, the Core i5 processor doesn't have Hyperthreading.
Systems 1,2,3,4 were running CentOS 6.4 Linux. System 5 was running Ubuntu 13.04 and system 6 was running Fedora 18. The underlying Linux system for this benchmark isn't really going to make any significant difference.
AVX2 is probably the real star of the show here since it is in essence a 256bit per core vector processing unit i.e. it executes multiple floating point instructions per clock cycle. Haswell has the latest iteration of this technology which is in general a newer replacement for what SSE instruction were used for in the past.
The problem size used for most of the results had the leading dimension of the matrix A of 25000. This used approximately 5GB of memory. I could have used a larger problem size but I just happened to be using that size for testing some Xeon Phi cards too 🙂 and it's a large enough problem size to give a good representation of the systems performance and takes about 1 minute on a Core i7 4770. The problem size for the job run on system 1 was actually 45000. I used that in the table since it was the only number I recorded of the testing runs I did on that machine.

The really impressive feature of these performance numbers is the implied performance per core. Just dividing the GFLOPS by the number of cores in a system is NOT a good way to judge per core performance but it does convey the pulse quickening information that I referred to earlier. For the actual thread scalling performace see table 2 below.

GFLOPS per core

I was so impressed with the performance of the Core i7 4770 that I bought one for myself! I built a new home system to replace my aging (vintage 2008) Core i7 920 system (system number 6 in table 1)

To round out these numbers, and confuse you more, (these are benchmarks after all!) I've Included another table showing the parallel scaling on the Xeon E5 2650 system and my new system with a "non-K" Core i7 4770 (I wanted VT-d capability which the "K" "over clocker" processors don't have).

Table 2 — OpenMP thread scaling

Threads	Dual Xeon E5-2650 16 real cores (GFLOPS)	Core i7-4770 4 real cores (GFLOPS)
32	262
16	263
8	137	176
4	74	177
2	40	96
1	21	50

Notes for Table 2:

You can see what I mean by "less than useless" referring to Hyperthreading for compute problems like Linpack. When it was first introduced it caused significant slow down for numerical computation and it was common practice to disable it in the BIOS. With the modern implementation it doesn't seem to really cause too much trouble so I just leave it enabled just in case there actually is something that benefits from it.
Something you should keep in mind about modern Intel processors is that they have "Turbo Boost"⁶ which is basically automatic overclocking when the processor core temperature and current draw is "low enough". That means that when you are putting the system under less than full load, you may get a clock boost on the cores that are actually working. That makes it harder to consistently tell what the real clock speed of a processor core is at any given moment, which complicates things like parallel thread scaling measurements.
Another thing to note about the numbers for the Core i7-4770 in this table. They were generated with the MKL libraries in "update4" of the the Intel Parallel Studio XE⁷ which has some new tweaks for AVX2 which boosted the Core [email protected] numbers up to that of the [email protected].

Bottom line:

The new Haswell processors have some serious floating point capability with the new AVX2 instructions … if you have code that can exploit it!

Footnotes:

¹ The Linpack benchmark is basically solving a system of linear equations by tri-diagonalization using Gaussian elimination. It's a common type of problem you see in numerical computing and it scales well thus making it a popular, and meaningful benchmark (in my opinion). Wikipedia has a good article on it.

² The Intel® Math Kernel Library is a collection numerical computing libraries providing highly optimized routines for Linear Algebra, Fast Fourier Transforms, Statistics etc…

³AVX (Advanced Vector Extensions) and, SSE (Streaming SIMD Extensions) before that, provides a nice SIMD (Single Instruction Multi Data) vector unit to x86 processors. You can get a significant speed up in programs that are optimized to take advantage of it. Wikipedia has a good page on AVX

⁴ OpenMP provides a multi-threading API for multi-core SMP parallel environments.

⁵ GFLOPS Giga FLOPS i.e billions of Floating point Operations Per Second.

⁶ Intel® Turbo Boost Technology I think of it as automatic, load dependent, over-clocking.

⁷ Intel® Parallel Studio XE 2013

Tags: Haswell