Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1560
Dr Donald Kinghorn (Scientific Computing Advisor )

AMD Ryzen 3900X vs Intel Xeon 2175W Python numpy - MKL vs OpenBLAS

Written on August 20, 2019 by Dr Donald Kinghorn
Share:

Introduction (and a bit of history!)

In this post I've done more testing with Ryzen 3900X looking at the effect of BLAS libraries on a simple but computationally demanding problem with Python numpy. The results may surprise you! I start with a little bit of history of Intel vs AMD performance to give you what may be a new perspective on the issue.

More reality check with the AMD Zen2 Ryzen 3900X! There are two necessary and equally important ingredients needed to do anything useful with a computer, the hardware and the software.

The hardware and software together with helpful tools like special performance/feature libraries, and community support when made available for developers and the users, in an organized, cohesive, professional and easily usable manner make up what is referred to as an "Ecosystem".

The best example (ever!), of a well done hardware/software ecosystem is what NVIDIA has done with GPU accelerated computing.

On the CPU side of computing, Intel has a very good, and lately, rapidly expanding, ecosystem for the x86 CPU architecture. Intel learned some lessons from the glory days of the proprietary UNIX hardware vendors. The UNIX hardware vendors all had their own proprietary compilers and numerical libraries i.e. BLAS libraries. When the UNIX hardware/software goliath, DEC (Digital Equipment Corporation), shutdown some of their talented compiler developers went to Intel (Important VMS OS devs went to Microsoft and created WindowsNT). Intel developed a "best in class" compiler suite and the highly optimized compute library collection MKL (Math Kernel Library).

What about AMD? AMD is the only other company to have license to the x86 architecture. Indeed AMD is responsible for many advancements to the arch, like 64-bit x86_64. ... which was originally known as AMD64! (amd64 is still used as the name extension for 64-bit x86 Linux binary packages). So, do the Intel compilers and MKL work with AMD CPU's? Yes, kind of...

When Intel first released their (fantastic) compilers and MKL it was soon discovered that they worked for AMD processors but didn't give very good performance (even though the core architecture was the same). Someone got the idea to spoof the processor ID (hard to do) on an AMD Opteron so that it responded as "Genuine Intel" sure enough the performance went way up! When program calls to MKL start up the first thing that happens is a check for "Genuine Intel" and the compute-core-features available. Then the calls execute a "code-path" with the best optimizations for the core features. But, if the ID is AMD then the code-path chosen is an old SSE optimization path, i.e. no modern performance optimizations. Intel has every right to do that! It IS their stuff, and there IS some incompatibility at the highest (or lowest) levels of optimization for the hardware. And, MKL is insanely well optimized for Intel CPU's ... as it should be!

Long story, short, there was a lawsuit and a settlement. Nothing much changed except that Intel has to include an "Optimization notice" on their publications.

The bottom line is that, Intel is a massive company compared to AMD and their resources for developing an "ecosystem" are much greater. AMD is doing good work on their ecosystem. There is an optimizing compiler project, AAOC and AMD's BLIS (BLAS) performance library.

To get the best out of AMD hardware you often have to do a little extra work. It's good to be aware of that! The new AMD hardware is looking really good and they have gotten a very important contract for what will be the US's first ExaScale Supercomputer. That DOE contract will provide a large amount of funding and highly skilled developers to expand and optimize the "ecosystem" for AMD. It will definitely be getting better ...soon!

Test systems: AMD Ryzen 3900X and Intel Xeon 2175W

AMD Hardware

  • AMD Ryzen 3900X 12-core AVX2
  • Motherboard Gigabyte X570 AORUS ULTRA
  • Memory 4x DDR4-3200 16GB (64GB total)
  • 2TB Intel 660p NVMe M.2
  • NVIDIA 2080Ti GPU

Intel Hardware

  • Intel Xeon-W 2175 14-core AVX512
  • ASUS C422 Pro SE (My personal workstation )
  • 128GB DDR4 2400 MHz Reg ECC memory
  • Samsung 960 EVO 1TB NVMe M.2
  • NVIDIA Titan V GPU

Software

  • Ubuntu 18.04
  • Anaconda Python build Anaconda3-2019.07-Linux-x86_64
  • numpy 1.16.4 (default env)
  • mkl 2019.4 (default env)
  • libopenblas 0.3.6 (in my "openblas-np env)

I will describe how to create an env with numpy linked to OpenBLAS in the section after the results.

Notes:

  • OpenBLAS is an excellent open source BLAS library based on the, highly regarded, work originally done by Kazushige Goto.
  • OpenBLAS does not currently have optimizations for AVX512 (It does include AVX2 optimizations)

Now onto some simple testing that will illustrate the consequences of the history discussed in the introduction.

Ryzen 3900X and Xeon 2175W performance using MKL and OpenBLAS for a Python numpy "norm of matrix product" calculation

numpy is the most commonly used numerical computing package in Python. The calculation presented in this testing is very simple but computationally intensive. It will take advantage of the BLAS library that gives numpy it's great performance. In this case we will use Anaconda Python with "envs" setup for numpy linked with Intel MKL (the default) and with OpenBLAS (described in the next section).


numpy Ryzen 3900X vs Xeon 2175W MKL vs OpenBLAS

Those are pretty dramatic differences! The standout features are,

  • MKL provides tremendous performance optimization on Intel CPU's The test job is definitely benefiting from AVX512 optimizations which are not available in this OpenBLAS version.
  • OpenBLAS levels the performance difference considerably by providing good optimization up to the level of AVX2. (keep in mind that the 2175W is 14-core vs 12-cores on the Ryzen 3900X)
  • The low optimization code-path used for AMD CPU's by MKL is devastating to performance.

This test clearly shows the effect of hardware specific code optimization. It is also pretty synthetic! In the real world programs are more complicated and are usually not anywhere near fully optimized especially in regards to vectorization that takes advantage of AVX. There are also common numerical libraries that are not so heavily targeted to specific architectures. For example, the popular, and very good, C++ boost library suite.

Note: I also tried to setup a numpy linked with AMD BLIS lib but it did not work correctly (very poor performance). I did not troubleshoot the issues.

Creating an "env" with conda that includes OpenBLAS for numpy

What I did to get numpy with different BLAS lib links in Anaconda python was simple.

Create and activate an env for the OpenBLAS linked numpy,

conda create --name openblas-np
conda activate openblas-np

Then install numpy specifying the BLAS library,

conda install numpy jupyter ipykernel blas=*=openblas 

Then I created a kernel for Jupyter notebook and started a notebook using that kernel,

python -m ipykernel install --user --name openblas-np

Following is the Jupyter notebook input cells for the test,


import numpy as np
import time
n = 20000
A = np.random.randn(n,n).astype('float64')
B = np.random.randn(n,n).astype('float64')
start_time = time.time()
nrm = np.linalg.norm(A@B)
print(" took {} seconds ".format(time.time() - start_time))
print(" norm = ",nrm)

Here's a command and the output that shows that the numpy configuration is indeed using OpenBLAS,

np.__config__.show()

    blas_mkl_info:
      NOT AVAILABLE
    blis_info:
      NOT AVAILABLE
    openblas_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/home/kinghorn/anaconda3/envs/openblas-np/lib']
        language = c
        define_macros = [('HAVE_CBLAS', None)]
    blas_opt_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/home/kinghorn/anaconda3/envs/openblas-np/lib']
        language = c
        define_macros = [('HAVE_CBLAS', None)]
    lapack_mkl_info:
      NOT AVAILABLE
    openblas_lapack_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/home/kinghorn/anaconda3/envs/openblas-np/lib']
        language = c
        define_macros = [('HAVE_CBLAS', None)]
    lapack_opt_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/home/kinghorn/anaconda3/envs/openblas-np/lib']
        language = c
        define_macros = [('HAVE_CBLAS', None)]

Conclusions

When I started this testing I was expecting to see the results like that I found, however, I was still surprised when I saw them! I hope this post has given you a new perspective on the AMD vs Intel thing. I have great respect for both of these companies and I know we are going to see good things from both of them going forward.

The AMD Ryzen Zen2 processors are impressive and seem to be an excellent value. We are working diligently on validating full platforms. We are not rushing this processes. There are still rough edges on the total system package that we want to get right. I am looking forward to getting my hands on the next Threadripper and hopefully will get a chance to fire up an Epyc Rome system too.

Happy computing! --dbk @dbkinghorn


Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of poweful and reliable systems that are tailor-made for your unique workflow.

Configure a System!

Labs Consultation Service

Our Labs team is available to provide in-depth hardware recommendations based on your workflow.

Find Out More!

Why Choose Puget Systems?


Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time of 7-10 business days on nearly all our system orders.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: Ryzen, Python, Scientific Computing, AMD, numpy, BLAS
Misha Engel

This is how business works.
Intel Xeon W-2175 16.1 second vs. AMD Ryzen 9 3900x 39.9 seconds.

The Intel is 248% faster at a 390% higher price.
Intel $1.947 vs. AMD $499

With other software like Blender, Davinci Resolve, Cinema4D the 3900x will be around 30% faster.
AMD spends money on better(opensource) compilers, Intel spends money to block AMD, nothing new.

Posted on 2019-08-21 01:14:31
lemans24

Really depends on what you are using the hardware for as it is only a one time cost. If you are dependent on calculations that drive your income then the 3175x could pay for itself very quickly...

Posted on 2019-08-26 17:45:22
lemans24

sorry meant 2175W

Posted on 2019-08-26 17:46:45
Misha Engel

Or update OpenBLAS, it will give you free performance.

Posted on 2019-08-27 14:36:11
lemans24

I definitely think you should optimize as much in software as possible and then buy the fastest hardware that you can afford with the best rate of return!! No such thing as free performance!!! Took me over 6 months of bare metal c/c++ programming in CUDA to get 100x increase in performance which if I was paid to do would have been into 6 figures!!! Now that I have optimal software performance, buying the fastest NVidia gpu cards is relatively cheap...

Posted on 2019-08-27 15:37:06
Jan Dorniak

Phoronix Has a nice test of how recent compilers work with AMD's Zen 2. If you recompile OpenBLAS with AOCC it might well turn out to gain a fair bit of performance. Even clang or gcc with -march=native should help although I'm not sure if the versions with zenver2 optimisation tables are already released.

In case you didn't know: a recent BIOS update for AMD chipsets fixes the systemd issue for Ubuntu 19.04 and other recent distributions.

Link to the AOCC test (if you allow links): https://www.phoronix.com/sc...

Posted on 2019-08-21 07:00:39
Kyle Vrooman

Thanks for this note. Always interested to see the extra steps you need to enable optimized performance.

Considering that the standard packages in Anaconda for Tensorflow-cpu now also build for MKL, I would assume there would be a large regression for AMD processors when the standard package changed over ? Compiling for march=native and non-mkl would seem important following from your investigation above...?

Posted on 2019-08-21 13:14:59
MagicWax

Please update your OpenBLAS version if possible. 0.3.7 is now out, and it has new optimizations for Zen2

Posted on 2019-08-21 18:28:57
Misha Engel

Since when is that in intel's interest?

Posted on 2019-08-21 20:49:36
MagicWax

Your point being...?

Posted on 2019-08-22 14:20:09
Donald Kinghorn

The BLAS lib performance of a CPU is important but it is highly optimized for a given arch. In the real world it's usually not impactful on complicated programs. The new Zen2 cores look great! ...overall. I wouldn't hesitate to recommend the new Ryzen or Epyc (or soon TR) :-)

We are very picky (conservative) at Puget Systems. we wont move to a new platform unless we know it's stable and performant. We do a lot of testing and we report what we get! Intel has been very solid for a long time (and still is!) The new AMD CPU's are doing quite well in out testing and the platform has stabilized enough that we are now moving some or our recommend systems to Ryzen. AMD is back in the game and we all think it's great!

Posted on 2019-11-14 17:06:43
MagicWax

As an aside, it is not too hard to take the MKL library and patch it with a hexeditor, so that it runs the Intel path on AMD. There are also some undocumented build tricks that one can use to knock out the CPU vendor checks from both the Intel compilers and MKL. Agner Fog has a great guide on how to do it, and unlike binary patching, you can distribute the resulting binary!

Posted on 2019-08-21 18:38:36
Donald Kinghorn

Hey Magic, Thanks for the mention of Agner Fog! I couldn't remember his name, he did some nice stuff ... I'll look him up again...

Posted on 2019-11-14 16:50:11
nandodmelo

I followed previously tutorial about Tensforflow gpu (https://www.pugetsystems.co...\%22%20ADD_DATE=).

If i install openblas in the same env, will cause any trouble?

Posted on 2019-10-02 13:38:14
Lance McCormick

okay but why the two different graphics cards..

Posted on 2019-10-08 03:30:23
Donald Kinghorn

The Xeon-W 14 core with the Titan V sys is my personal box. The Ryzen with the 2080Ti was at the office ... just happened to have the 2080Ti in there :-)

Posted on 2019-11-14 16:48:18
Donald Kinghorn

Sorry everyone I was not receiving notification of comments on this post!

Posted on 2019-11-14 16:47:30
KM Video

Can you try this and include this test?
https://www.reddit.com/r/ma...

Posted on 2019-11-19 19:17:19

Just FYI, Dr. Kinghorn is away at Supercomputing this week - so it may be a few days before he is able to reply to comments :)

Posted on 2019-11-19 19:24:20
Donald Kinghorn

That is a great idea! I wasn't aware of using MKL DEBUG environment in that way. I will definitely try it. That will hopefully allow for a better comparison between different BLAS libraries ... I was actually going to try a more drastic hack than that to force optimizations but if this works it would be much better.

I should have access to a good system for testing when I get back from SC19 and will try to get a post up about it quickly Thanks! --Don

Posted on 2019-11-20 04:37:29
Donald Kinghorn

I posted both places ...
OK, that's interesting! I'm not sure what is going on there. I don't have access to a AMD test system right now so I can't check it out.
I did re-run this in a fresh env with MKL 2020.1 on my Xeon-W 2175 sys. Here's what I got, Last result is pretty interesting!

(numpy-mkl) kinghorn@i9:~/projects/TR32-testing/numpy-matnorm$ python matnorm.py
took 16.17172908782959 seconds

(numpy-openblas) kinghorn@i9:~/projects/TR32-testing/numpy-matnorm$ python matnorm.py
took 38.571579933166504 seconds

And this is the interesting part;

(numpy-mkl) kinghorn@i9:~/projects/TR32-testing/numpy-matnorm$ export MKL_DEBUG_CPU_TYPE=5
(numpy-mkl) kinghorn@i9:~/projects/TR32-testing/numpy-matnorm$ python matnorm.py
took 16.00005531311035 seconds

Enabling the debug flag did not change the MKL result! (much) I would have expected to have a performance drop of 25-35% Assuming that the flag is setting the MKL to a Haswell AVX2 CPU type.

This is even more interesting;

(numpy-mkl) kinghorn@i9:~/projects/TR32-testing/numpy-matnorm$ export MKL_ENABLE_INSTRUCTIONS=AVX2
(numpy-mkl) kinghorn@i9:~/projects/TR32-testing/numpy-matnorm$ python matnorm.py
took 26.644426584243774 seconds

MKL_ENABLE_INSTRUCTIONS=AVX2 is giving me a result that I expected from MKL_DEBUG_CPU_TYPE=5 ????

You might want to try setting MKL_ENABLE_INSTRUCTIONS=AVX2 on your Ryzen sys and see if it makes any difference ... let me know! --Don

Posted on 2020-06-22 21:00:06
Shihab Shahriar

I ran above code in ryzen 3700x, with 16GB of ram.

The openblas version was ~59 seconds, while MKL version was ~56 seconds.

The speed of openblas version makes sense, since I'm using a weaker CPU. But how is my MKL version so fast compared to above result?

Numpy: 1.18.1
openblas: 0.3.6
MKL:2020.1

Maybe Intel/MKL has stopped deliberately punishing AMD cpus?

Posted on 2020-06-19 20:13:51
babiloe

Made me wonder, if high end simulation that depend heavily on intel mkl like abaqus or nastran will create little improvement on amd zen2 or zen3 than intel. I wish puget has it. The solidworks stress testing already show intel faster even though in other testing zen2 amd ran better

Posted on 2020-11-14 00:18:01
Donald Kinghorn

Yes, When code is optimized for, and well suited to, vectorization AVX512 can be a significant boost. It was many years before Intel released MKL for general (free) use, but when they did, many software developers started linking to it by default. (Like Matlab, Anaconda Python ...) This causes some trouble for AMD CPU's not so much because of AVX512 vs AVX2 but because MKL detects "non genuine Intel" and then sets a code path that is not even optimized for AVX2.

They did disable the DEBUG work-around like I thought they would when I wrote this post but they have improved MKL performance for AMD somewhat. I'll be doing serious testing with this stuff again early in the new year. The new AMD processors are really good! ... and what's coming up looks even better.

For now at least, it's going to be "easier" to get good performance on Intel for some programs but that could change

Posted on 2020-11-14 01:25:36
babiloe

The funny thing about avx2 and avx512 , it will be lowering the cpu frequency. Unless the computer running as a computing working station avx2/avx512, it will be harm for other service like a server, since it will lowering freq to 10% lower.

https://blog.cloudflare.com...

Now I'm trying to use same trick on windows with appinit dll that use same preload, but I guess it need test mode with driver signing off and secure.boot off.

Posted on 2020-11-20 18:36:23
Donald Kinghorn

That clock lowering was my biggest disappointment with AVX512 when it was first added. I was expecting this massive performance increase like there was from AVX to AVX2. Still it's pretty nice when you running well vectorized code (and nothing else).

I'm thinking the days of CPU vector unit improvements are over. It just makes more sense to offload that kind of a workload to an accelerator of some sort. ... curious to see what AMD does with Xilinx

Posted on 2020-11-20 20:11:06