How To Use MKL with AMD Ryzen and Threadripper CPU's (Effectively) for Python Numpy (And Other Applications)

Table of Contents

Introduction

In this post I'm going to show you a simple way to significantly speedup Python numpy compute performance on AMD CPU's when using Anaconda Python

We will set a DEBUG environment variable for Intel MKL that forces it to use the AVX2 vector unit on AMD CPU's (this will work for other applications too, like MATLAB for example.) … but please see "BIG Caveat!" at the end of this post.

You may be wondering why this is an issue. In a recent post "AMD Ryzen 3900X vs Intel Xeon 2175W Python numpy – MKL vs OpenBLAS" I showed how to do the first method using OpenBLAS and how bad performance was with AMD when using MKL. I also gave a bit of an history lesson explaining the long running "Optimization" issue between AMD and Intel. The short story is that Intel checks for "Genuine Intel" CPU's when it's numerical library MKL starts executing code. If it find an Intel CPU then it will follow an optimal code path for maximum performance on hardware. If it finds and AMD processor is takes a code path that only optimizes to the old (ancient) SSE2 instruction level i.e it doesn't take advantage of the performance features on AMD and the performance will be several times slower than it "need" to be.

The following paragraph is one of the most regretted things I've written …

Maybe you're thinking that it's not "fair" for Intel to do that, but … Intel has every right to do that! It IS their stuff. They worked hard utilized a lot of resources to develope it. And, there IS some incompatibility at the highest (or lowest) levels of optimization for the hardware. MKL is insanely well optimized for Intel CPU's … as it should be!

I honestly don't feel that way! This came up years ago when Intel first started marketing their compilers. I was outraged then and really, I still am. I think I was trying to "justify" it because I couldn't change it then and I can't change it now. The only thing I can do now is show people how to get around it! For everyone that takes offense to that paragraph, I understand and I regret saying it. Collectively we should continue the fight against that sort of corporate tactic. –dbk

Read the post listed above if you are interested in this old and ongoing issue.

In the next sections we'll look at performance results from a simple numpy matrix algebra problem. There will be results from the post that was linked above along with new results using the 24-core AMD Threadripper 3960x.

Test systems: AMD Threadripper 3960x, Ryzen 3900X and Intel Xeon 2175W

AMD Hardware

AMD Threadripper 3960x 24-core AVX2
Motherboard Gigabyte TRX40 AORUS EXTREME
Memory 8x DDR4-2933 16GB (128GB total)
1TB Samsung 960 EVO NVMe M.2
NVIDIA RTX 2080Ti GPU

AMD Ryzen 3900X 12-core AVX2
Motherboard Gigabyte X570 AORUS ULTRA
Memory 4x DDR4-3200 16GB (64GB total)
2TB Intel 660p NVMe M.2
NVIDIA 2080Ti GPU

Intel Hardware

Intel Xeon-W 2175 14-core AVX512
ASUS C422 Pro SE (My personal workstation )
128GB DDR4 2400 MHz Reg ECC memory
Samsung 960 EVO 1TB NVMe M.2
NVIDIA Titan V GPU

Software

Ubuntu 18.04
Anaconda Python build Anaconda3-2019.07-Linux-x86_64
numpy 1.16.4
mkl 2019.4
libopenblas 0.3.6

Notes:

OpenBLAS is an excellent open source BLAS library based on the, highly regarded, work originally done by Kazushige Goto.
OpenBLAS does not currently have optimizations for AVX512 (It does include AVX2 optimizations)

Using MKL_DEBUG_CPU_TYPE=5 with AMD CPU’s

The environment variable above is the "new secret way" to fool MKL into using an AVX2 optimization level on AMD CPU's. This environment variable has been available for years but it is not documented. PLEASE SEE THE CAVEAT IN THE CONCLUSION!

I seem to remember this from long ago with Opteron?? In any case it has been making the rounds on forums recently as a solution for getting MATLAB to perform better on AMD CPU's (other use cases too). This should work for any application that is making calls to the MKL runtime library. I believe it is forcing MKL to take the Haswell/Broadwell code path which gives an optimization level that includes AVX2. By default, MKL looks for "Genuine Intel" and if it doesn't find that it drops to a code path only optimized to SSE2 instruction level i.e. no modern hardware optimizations.

On Linux you would set this environment variable in your working shell or add it to .bashrc

export MKL_DEBUG_CPU_TYPE=5

In a Jupyter notebook cell you could use (!),

!export MKL_DEBUG_CPU_TYPE=5

On Windows 10 you could set this in (Anaconda) Powershell as,

$Env:MKL_DEBUG_CPU_TYPE=5

or, you could set it in a Jupyter notebook cell using (!) the same as in Linux.

You can also set this in System in Control Panel (Advanced tab or the Advanced System Settings item),

Threadripper 3960x, Ryzen 3900X and Xeon 2175W performance using MKL, MKL_DEBUG_CPU_TYPE=5 and OpenBLAS for a Python numpy “norm of matrix product” calculation

numpy is the most commonly used numerical computing package in Python. The calculation presented in this testing is very simple but computationally intensive. It will take advantage of the BLAS library that gives numpy it's great performance. In this case we will use Anaconda Python with "envs" setup for numpy linked with Intel MKL (the default) and with OpenBLAS (described in the next section).

numpy Ryzen 3900X vs Xeon 2175W MKL vs OpenBLAS

Look at those results and think about it for awhile … The standout features are,

The best result in the chart is for the TR 3960x using MKL with the environment var MKL_DEBUG_CPU_TYPE=5. AND it is significantly better than the low optimization code path from MKL alone. AND,OpenBLAS does nearly as well as MKL with MKL_DEBUG_CPU_TYPE=5 set.
MKL provides tremendous performance optimization on Intel CPU's The test job is definitely benefiting from AVX512 optimizations which are not available in this OpenBLAS version.
OpenBLAS levels the performance difference considerably by providing good optimization up to the level of AVX2. (keep in mind that the 2175W is 14-core vs 12-cores on the Ryzen 3900X and 24 cores on the TR 3960x)
The low optimization code-path used for AMD CPU's by MKL is devastating to performance.

This test clearly shows the effect of hardware specific code optimization. It is also pretty synthetic! In the real world programs are more complicated and are usually not anywhere near fully optimized especially in regards to vectorization that takes advantage of AVX. There are also common numerical libraries that are not so heavily targeted to specific architectures. For example, the popular, and very good, C++ boost library suite.

Creating an “env” with conda that includes OpenBLAS for numpy

Please see this older post "AMD Ryzen 3900X vs Intel Xeon 2175W Python numpy – MKL vs OpenBLAS" for information on how to use OpenBLAS with Anaconda Python and the Python code that was used for this testing.

Conclusion (BIG Caveat!)

I have to reiterate, MKL_DEBUG_CPU_TYPE is an undocumented environment variable. That means that Intel can remove it at any time without warning. And, they have every right to do that! It is obviously intended for internal debugging, not for running with better performance on AMD hardware. It is also possible that the resulting code path has some precision loss or other problems on AMD hardware. I have not tested for that!

The best solution for running numerical intensive code on AMD CPU's is to try working with AMD's BLIS library if you can. Version 2.0 of BLIS gave very good performance in my recent testing on the new 3rd gen Threadripper. For the numpy testing above it would be great to be able to use the BLIS v2.0 library with Anaconda Python the same way that I used OpenBLAS. Someone just needs to setup the conda package with the proper hooks to set it as default BLAS. I don't have the time or expertise to do this myself, so, if you can do it then please do! and let me know about it!

Happy computing! –dbk @dbkinghorn