Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1560
Dr Donald Kinghorn (Scientific Computing Advisor )

AMD Ryzen 3900X vs Intel Xeon 2175W Python numpy - MKL vs OpenBLAS

Written on August 20, 2019 by Dr Donald Kinghorn
Share:

Introduction (and a bit of history!)

In this post I've done more testing with Ryzen 3900X looking at the effect of BLAS libraries on a simple but computationally demanding problem with Python numpy. The results may surprise you! I start with a little bit of history of Intel vs AMD performance to give you what may be a new perspective on the issue.

More reality check with the AMD Zen2 Ryzen 3900X! There are two necessary and equally important ingredients needed to do anything useful with a computer, the hardware and the software.

The hardware and software together with helpful tools like special performance/feature libraries, and community support when made available for developers and the users, in an organized, cohesive, professional and easily usable manner make up what is referred to as an "Ecosystem".

The best example (ever!), of a well done hardware/software ecosystem is what NVIDIA has done with GPU accelerated computing.

On the CPU side of computing, Intel has a very good, and lately, rapidly expanding, ecosystem for the x86 CPU architecture. Intel learned some lessons from the glory days of the proprietary UNIX hardware vendors. The UNIX hardware vendors all had their own proprietary compilers and numerical libraries i.e. BLAS libraries. When the UNIX hardware/software goliath, DEC (Digital Equipment Corporation), shutdown some of their talented compiler developers went to Intel (Important VMS OS devs went to Microsoft and created WindowsNT). Intel developed a "best in class" compiler suite and the highly optimized compute library collection MKL (Math Kernel Library).

What about AMD? AMD is the only other company to have license to the x86 architecture. Indeed AMD is responsible for many advancements to the arch, like 64-bit x86_64. ... which was originally known as AMD64! (amd64 is still used as the name extension for 64-bit x86 Linux binary packages). So, do the Intel compilers and MKL work with AMD CPU's? Yes, kind of...

When Intel first released their (fantastic) compilers and MKL it was soon discovered that they worked for AMD processors but didn't give very good performance (even though the core architecture was the same). Someone got the idea to spoof the processor ID (hard to do) on an AMD Opteron so that it responded as "Genuine Intel" sure enough the performance went way up! When program calls to MKL start up the first thing that happens is a check for "Genuine Intel" and the compute-core-features available. Then the calls execute a "code-path" with the best optimizations for the core features. But, if the ID is AMD then the code-path chosen is an old SSE optimization path, i.e. no modern performance optimizations. Intel has every right to do that! It IS their stuff, and there IS some incompatibility at the highest (or lowest) levels of optimization for the hardware. And, MKL is insanely well optimized for Intel CPU's ... as it should be!

Long story, short, there was a lawsuit and a settlement. Nothing much changed except that Intel has to include an "Optimization notice" on their publications.

The bottom line is that, Intel is a massive company compared to AMD and their resources for developing an "ecosystem" are much greater. AMD is doing good work on their ecosystem. There is an optimizing compiler project, AAOC and AMD's BLIS (BLAS) performance library.

To get the best out of AMD hardware you often have to do a little extra work. It's good to be aware of that! The new AMD hardware is looking really good and they have gotten a very important contract for what will be the US's first ExaScale Supercomputer. That DOE contract will provide a large amount of funding and highly skilled developers to expand and optimize the "ecosystem" for AMD. It will definitely be getting better ...soon!

Test systems: AMD Ryzen 3900X and Intel Xeon 2175W

AMD Hardware

  • AMD Ryzen 3900X 12-core AVX2
  • Motherboard Gigabyte X570 AORUS ULTRA
  • Memory 4x DDR4-3200 16GB (64GB total)
  • 2TB Intel 660p NVMe M.2
  • NVIDIA 2080Ti GPU

Intel Hardware

  • Intel Xeon-W 2175 14-core AVX512
  • ASUS C422 Pro SE (My personal workstation )
  • 128GB DDR4 2400 MHz Reg ECC memory
  • Samsung 960 EVO 1TB NVMe M.2
  • NVIDIA Titan V GPU

Software

  • Ubuntu 18.04
  • Anaconda Python build Anaconda3-2019.07-Linux-x86_64
  • numpy 1.16.4 (default env)
  • mkl 2019.4 (default env)
  • libopenblas 0.3.6 (in my "openblas-np env)

I will describe how to create an env with numpy linked to OpenBLAS in the section after the results.

Notes:

  • OpenBLAS is an excellent open source BLAS library based on the, highly regarded, work originally done by Kazushige Goto.
  • OpenBLAS does not currently have optimizations for AVX512 (It does include AVX2 optimizations)

Now onto some simple testing that will illustrate the consequences of the history discussed in the introduction.

Ryzen 3900X and Xeon 2175W performance using MKL and OpenBLAS for a Python numpy "norm of matrix product" calculation

numpy is the most commonly used numerical computing package in Python. The calculation presented in this testing is very simple but computationally intensive. It will take advantage of the BLAS library that gives numpy it's great performance. In this case we will use Anaconda Python with "envs" setup for numpy linked with Intel MKL (the default) and with OpenBLAS (described in the next section).


numpy Ryzen 3900X vs Xeon 2175W MKL vs OpenBLAS

Those are pretty dramatic differences! The standout features are,

  • MKL provides tremendous performance optimization on Intel CPU's The test job is definitely benefiting from AVX512 optimizations which are not available in this OpenBLAS version.
  • OpenBLAS levels the performance difference considerably by providing good optimization up to the level of AVX2. (keep in mind that the 2175W is 14-core vs 12-cores on the Ryzen 3900X)
  • The low optimization code-path used for AMD CPU's by MKL is devastating to performance.

This test clearly shows the effect of hardware specific code optimization. It is also pretty synthetic! In the real world programs are more complicated and are usually not anywhere near fully optimized especially in regards to vectorization that takes advantage of AVX. There are also common numerical libraries that are not so heavily targeted to specific architectures. For example, the popular, and very good, C++ boost library suite.

Note: I also tried to setup a numpy linked with AMD BLIS lib but it did not work correctly (very poor performance). I did not troubleshoot the issues.

Creating an "env" with conda that includes OpenBLAS for numpy

What I did to get numpy with different BLAS lib links in Anaconda python was simple.

Create and activate an env for the OpenBLAS linked numpy,

conda create --name openblas-np
conda activate openblas-np

Then install numpy specifying the BLAS library,

conda install numpy jupyter ipykernel blas=*=openblas 

Then I created a kernel for Jupyter notebook and started a notebook using that kernel,

python -m ipykernel install --user --name openblas-np

Following is the Jupyter notebook input cells for the test,


import numpy as np
import time
n = 20000
A = np.random.randn(n,n).astype('float64')
B = np.random.randn(n,n).astype('float64')
start_time = time.time()
nrm = np.linalg.norm(A@B)
print(" took {} seconds ".format(time.time() - start_time))
print(" norm = ",nrm)

Here's a command and the output that shows that the numpy configuration is indeed using OpenBLAS,

np.__config__.show()

    blas_mkl_info:
      NOT AVAILABLE
    blis_info:
      NOT AVAILABLE
    openblas_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/home/kinghorn/anaconda3/envs/openblas-np/lib']
        language = c
        define_macros = [('HAVE_CBLAS', None)]
    blas_opt_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/home/kinghorn/anaconda3/envs/openblas-np/lib']
        language = c
        define_macros = [('HAVE_CBLAS', None)]
    lapack_mkl_info:
      NOT AVAILABLE
    openblas_lapack_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/home/kinghorn/anaconda3/envs/openblas-np/lib']
        language = c
        define_macros = [('HAVE_CBLAS', None)]
    lapack_opt_info:
        libraries = ['openblas', 'openblas']
        library_dirs = ['/home/kinghorn/anaconda3/envs/openblas-np/lib']
        language = c
        define_macros = [('HAVE_CBLAS', None)]

Conclusions

When I started this testing I was expecting to see the results like that I found, however, I was still surprised when I saw them! I hope this post has given you a new perspective on the AMD vs Intel thing. I have great respect for both of these companies and I know we are going to see good things from both of them going forward.

The AMD Ryzen Zen2 processors are impressive and seem to be an excellent value. We are working diligently on validating full platforms. We are not rushing this processes. There are still rough edges on the total system package that we want to get right. I am looking forward to getting my hands on the next Threadripper and hopefully will get a chance to fire up an Epyc Rome system too.

Happy computing! --dbk @dbkinghorn


Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of workstations that are tailor made for your unique workflow. Our goal is to provide most effective and reliable system possible so you can concentrate on your work and not worry about your computer.

Configure a System!

Why Choose Puget Systems?


Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time of 7-10 business days on nearly all our system orders.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: Ryzen, Python, Scientific Computing, AMD, numpy, BLAS
Misha Engel

This is how business works.
Intel Xeon W-2175 16.1 second vs. AMD Ryzen 9 3900x 39.9 seconds.

The Intel is 248% faster at a 390% higher price.
Intel $1.947 vs. AMD $499

With other software like Blender, Davinci Resolve, Cinema4D the 3900x will be around 30% faster.
AMD spends money on better(opensource) compilers, Intel spends money to block AMD, nothing new.

Posted on 2019-08-21 01:14:31
lemans24

Really depends on what you are using the hardware for as it is only a one time cost. If you are dependent on calculations that drive your income then the 3175x could pay for itself very quickly...

Posted on 2019-08-26 17:45:22
lemans24

sorry meant 2175W

Posted on 2019-08-26 17:46:45
Misha Engel

Or update OpenBLAS, it will give you free performance.

Posted on 2019-08-27 14:36:11
lemans24

I definitely think you should optimize as much in software as possible and then buy the fastest hardware that you can afford with the best rate of return!! No such thing as free performance!!! Took me over 6 months of bare metal c/c++ programming in CUDA to get 100x increase in performance which if I was paid to do would have been into 6 figures!!! Now that I have optimal software performance, buying the fastest NVidia gpu cards is relatively cheap...

Posted on 2019-08-27 15:37:06
Jan Dorniak

Phoronix Has a nice test of how recent compilers work with AMD's Zen 2. If you recompile OpenBLAS with AOCC it might well turn out to gain a fair bit of performance. Even clang or gcc with -march=native should help although I'm not sure if the versions with zenver2 optimisation tables are already released.

In case you didn't know: a recent BIOS update for AMD chipsets fixes the systemd issue for Ubuntu 19.04 and other recent distributions.

Link to the AOCC test (if you allow links): https://www.phoronix.com/sc...

Posted on 2019-08-21 07:00:39
Kyle Vrooman

Thanks for this note. Always interested to see the extra steps you need to enable optimized performance.

Considering that the standard packages in Anaconda for Tensorflow-cpu now also build for MKL, I would assume there would be a large regression for AMD processors when the standard package changed over ? Compiling for march=native and non-mkl would seem important following from your investigation above...?

Posted on 2019-08-21 13:14:59
MagicWax

Please update your OpenBLAS version if possible. 0.3.7 is now out, and it has new optimizations for Zen2

Posted on 2019-08-21 18:28:57
Misha Engel

Since when is that in intel's interest?

Posted on 2019-08-21 20:49:36
MagicWax

Your point being...?

Posted on 2019-08-22 14:20:09
MagicWax

As an aside, it is not too hard to take the MKL library and patch it with a hexeditor, so that it runs the Intel path on AMD. There are also some undocumented build tricks that one can use to knock out the CPU vendor checks from both the Intel compilers and MKL. Agner Fog has a great guide on how to do it, and unlike binary patching, you can distribute the resulting binary!

Posted on 2019-08-21 18:38:36