Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1637
Dr Donald Kinghorn (Scientific Computing Advisor )

How To Use MKL with AMD Ryzen and Threadripper CPU's (Effectively) for Python Numpy (And Other Applications)

Written on November 27, 2019 by Dr Donald Kinghorn
Share:

Introduction

In this post I'm going to show you a simple way to significantly speedup Python numpy compute performance on AMD CPU's when using Anaconda Python

We will set a DEBUG environment variable for Intel MKL that forces it to use the AVX2 vector unit on AMD CPU's (this will work for other applications too, like MATLAB for example.) ... but please see "BIG Caveat!" at the end of this post.

You may be wondering why this is an issue. In a recent post "AMD Ryzen 3900X vs Intel Xeon 2175W Python numpy - MKL vs OpenBLAS" I showed how to do the first method using OpenBLAS and how bad performance was with AMD when using MKL. I also gave a bit of an history lesson explaining the long running "Optimization" issue between AMD and Intel. The short story is that Intel checks for "Genuine Intel" CPU's when it's numerical library MKL starts executing code. If it find an Intel CPU then it will follow an optimal code path for maximum performance on hardware. If it finds and AMD processor is takes a code path that only optimizes to the old (ancient) SSE2 instruction level i.e it doesn't take advantage of the performance features on AMD and the performance will be several times slower than it "need" to be.

The following paragraph is one of the most regretted things I've written ...

Maybe you're thinking that it's not "fair" for Intel to do that, but ... Intel has every right to do that! It IS their stuff. They worked hard utilized a lot of resources to develope it. And, there IS some incompatibility at the highest (or lowest) levels of optimization for the hardware. MKL is insanely well optimized for Intel CPU's ... as it should be!

I honestly don't feel that way! This came up years ago when Intel first started marketing their compilers. I was outraged then and really, I still am. I think I was trying to "justify" it because I couldn't change it then and I can't change it now. The only thing I can do now is show people how to get around it! For everyone that takes offense to that paragraph, I understand and I regret saying it. Collectively we should continue the fight against that sort of corporate tactic. --dbk

Read the post listed above if you are interested in this old and ongoing issue.

In the next sections we'll look at performance results from a simple numpy matrix algebra problem. There will be results from the post that was linked above along with new results using the 24-core AMD Threadripper 3960x.

Test systems: AMD Threadripper 3960x, Ryzen 3900X and Intel Xeon 2175W

AMD Hardware

  • AMD Threadripper 3960x 24-core AVX2
  • Motherboard Gigabyte TRX40 AORUS EXTREME
  • Memory 8x DDR4-2933 16GB (128GB total)
  • 1TB Samsung 960 EVO NVMe M.2
  • NVIDIA RTX 2080Ti GPU
  • AMD Ryzen 3900X 12-core AVX2
  • Motherboard Gigabyte X570 AORUS ULTRA
  • Memory 4x DDR4-3200 16GB (64GB total)
  • 2TB Intel 660p NVMe M.2
  • NVIDIA 2080Ti GPU

Intel Hardware

  • Intel Xeon-W 2175 14-core AVX512
  • ASUS C422 Pro SE (My personal workstation )
  • 128GB DDR4 2400 MHz Reg ECC memory
  • Samsung 960 EVO 1TB NVMe M.2
  • NVIDIA Titan V GPU

Software

  • Ubuntu 18.04
  • Anaconda Python build Anaconda3-2019.07-Linux-x86_64
  • numpy 1.16.4
  • mkl 2019.4
  • libopenblas 0.3.6

Notes:

  • OpenBLAS is an excellent open source BLAS library based on the, highly regarded, work originally done by Kazushige Goto.
  • OpenBLAS does not currently have optimizations for AVX512 (It does include AVX2 optimizations)

Using MKL_DEBUG_CPU_TYPE=5 with AMD CPU's

The environment variable above is the "new secret way" to fool MKL into using an AVX2 optimization level on AMD CPU's. This environment variable has been available for years but it is not documented. PLEASE SEE THE CAVEAT IN THE CONCLUSION!

I seem to remember this from long ago with Opteron?? In any case it has been making the rounds on forums recently as a solution for getting MATLAB to perform better on AMD CPU's (other use cases too). This should work for any application that is making calls to the MKL runtime library. I believe it is forcing MKL to take the Haswell/Broadwell code path which gives an optimization level that includes AVX2. By default, MKL looks for "Genuine Intel" and if it doesn't find that it drops to a code path only optimized to SSE2 instruction level i.e. no modern hardware optimizations.

On Linux you would set this environment variable in your working shell or add it to .bashrc

export MKL_DEBUG_CPU_TYPE=5

In a Jupyter notebook cell you could use (!),

!export MKL_DEBUG_CPU_TYPE=5

On Windows 10 you could set this in (Anaconda) Powershell as,

$Env:MKL_DEBUG_CPU_TYPE=5 

or, you could set it in a Jupyter notebook cell using (!) the same as in Linux.

You can also set this in System in Control Panel (Advanced tab or the Advanced System Settings item),

System control panel Env

Threadripper 3960x, Ryzen 3900X and Xeon 2175W performance using MKL, MKL_DEBUG_CPU_TYPE=5 and OpenBLAS for a Python numpy "norm of matrix product" calculation

numpy is the most commonly used numerical computing package in Python. The calculation presented in this testing is very simple but computationally intensive. It will take advantage of the BLAS library that gives numpy it's great performance. In this case we will use Anaconda Python with "envs" setup for numpy linked with Intel MKL (the default) and with OpenBLAS (described in the next section).


numpy Ryzen 3900X vs Xeon 2175W MKL vs OpenBLAS

Look at those results and think about it for awhile ... The standout features are,

  • The best result in the chart is for the TR 3960x using MKL with the environment var MKL_DEBUG_CPU_TYPE=5. AND it is significantly better than the low optimization code path from MKL alone. AND,OpenBLAS does nearly as well as MKL with MKL_DEBUG_CPU_TYPE=5 set.
  • MKL provides tremendous performance optimization on Intel CPU's The test job is definitely benefiting from AVX512 optimizations which are not available in this OpenBLAS version.
  • OpenBLAS levels the performance difference considerably by providing good optimization up to the level of AVX2. (keep in mind that the 2175W is 14-core vs 12-cores on the Ryzen 3900X and 24 cores on the TR 3960x)
  • The low optimization code-path used for AMD CPU's by MKL is devastating to performance.

This test clearly shows the effect of hardware specific code optimization. It is also pretty synthetic! In the real world programs are more complicated and are usually not anywhere near fully optimized especially in regards to vectorization that takes advantage of AVX. There are also common numerical libraries that are not so heavily targeted to specific architectures. For example, the popular, and very good, C++ boost library suite.

Creating an "env" with conda that includes OpenBLAS for numpy

Please see this older post "AMD Ryzen 3900X vs Intel Xeon 2175W Python numpy - MKL vs OpenBLAS" for information on how to use OpenBLAS with Anaconda Python and the Python code that was used for this testing.

Conclusion (BIG Caveat!)

I have to reiterate, MKL_DEBUG_CPU_TYPE is an undocumented environment variable. That means that Intel can remove it at any time without warning. And, they have every right to do that! It is obviously intended for internal debugging, not for running with better performance on AMD hardware. It is also possible that the resulting code path has some precision loss or other problems on AMD hardware. I have not tested for that!

The best solution for running numerical intensive code on AMD CPU's is to try working with AMD's BLIS library if you can. Version 2.0 of BLIS gave very good performance in my recent testing on the new 3rd gen Threadripper. For the numpy testing above it would be great to be able to use the BLIS v2.0 library with Anaconda Python the same way that I used OpenBLAS. Someone just needs to setup the conda package with the proper hooks to set it as default BLAS. I don't have the time or expertise to do this myself, so, if you can do it then please do! and let me know about it!

Happy computing! --dbk @dbkinghorn


Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of workstations that are tailor-made for your unique workflow. Our goal is to provide the most effective and reliable system possible so you can concentrate on your work and not worry about your computer.

Configure a System!

Why Choose Puget Systems?


Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time of 7-10 business days on nearly all our system orders.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: Ryzen, Python, Scientific Computing, AMD, numpy, BLAS, Threadripper
Methylzero

Openblas 0.3.7 now has some AVX512 optimization, and further optimizations are present in trunk, so when 0.3.8 comes around it will have quite good AVX512 performance on at least some of the BLAS functions.

Posted on 2019-11-28 11:51:29
Donald Kinghorn

Thanks for adding your comment! People should know that OpenBLAS is more up-to-date than what is pulled down by conda as I did in this post. It would be good to see that updated ... I'd also really like to see a BLIS setup

Posted on 2019-11-30 00:02:03
tim3lord

You should make a post about installing OpenBLAS with numpy and pytorch. I can't seem to figure it out on my AMD Threadripper machine!

Posted on 2019-11-30 00:13:19
Donald Kinghorn

If you check out the post "AMD Ryzen 3900X vs Intel Xeon 2175W Python numpy - MKL vs OpenBLAS" you'll see a way to do it with Anaconda Python. It's basically changing the default BLAS runtime lib in the environment. I haven't checked this with PyTorch yet.

In that post I added OpenBLAS when I setup the Jupyter kernel but you can (and maybe should) do it when you create the env,

conda create openblas-np blas=*=openblas

You are right, I should write up a better post detailing how that works ... and check it out for PyTorch (when I first worked on this stuff I was going to do PyTorch but just did numpy for the write up ...)

Posted on 2019-11-30 00:33:06
tim3lord

Thanks for all of the work you put into your articles. Very helpful!

Posted on 2019-11-30 00:42:20
lhl

I don't know anything about AMD BLIS, but the default BLIS package is fairly easy to install in Anaconda:

conda create -c conda-forge -n numpy-blis numpy "blas=*=blis"

And based on Zen1 benchmarks, it might significantly outperform OpenBLAS: https://github.com/flame/bl...

Posted on 2019-12-08 07:38:59
Donald Kinghorn

... I put a longer reply at the end of your results ... I liked you idea of dropping the newer dynamic obj archive on top of the .6 versions ...

Posted on 2019-12-09 16:03:42
Donald Kinghorn

... I just noticed that OpenBLAS 0.3.7 is in conda-forge when I was replying to tim3lord below ... I feel another post coming on :-)

Posted on 2019-11-30 00:35:20
Donald Kinghorn

yup here it is,
kinghorn@u18tr:~$ conda create --name openblas.3.7 -c conda-forge blas=*=openblas
Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

environment location: /home/kinghorn/miniconda3/envs/openblas.3.7

added / updated specs:
- blas[build=openblas]

The following packages will be downloaded:

package | build
---------------------------|-----------------
blas-2.14 | openblas 10 KB conda-forge
libblas-3.8.0 | 14_openblas 10 KB conda-forge
libcblas-3.8.0 | 14_openblas 10 KB conda-forge
libgcc-ng-9.2.0 | hdf63c60_0 8.6 MB conda-forge
liblapack-3.8.0 | 14_openblas 10 KB conda-forge
liblapacke-3.8.0 | 14_openblas 10 KB conda-forge
libopenblas-0.3.7 | h6e990d7_3 7.6 MB conda-forge
------------------------------------------------------------
Total: 16.3 MB

Posted on 2019-11-30 01:39:58
Donald Kinghorn

Here is a quick test on Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz (I have access to this for a few days)

MKL default
kinghorn@u18tr:~$ conda activate nptest
(nptest) kinghorn@u18tr:~$ python nptest.py
took 10.723653316497803 seconds
norm = 2828339.6004819106

OpenBLAS 0.3.7 from -c conda-forge
(openblas.3.7) kinghorn@u18tr:~$ python nptest.py
took 38.21160125732422 seconds
norm = 2828248.9869486988

OpenBLAS 0.3.6 from -c anaconda
kinghorn@u18tr:~$ conda activate np-openblas
(np-openblas) kinghorn@u18tr:~$ python nptest.py
took 24.212520122528076 seconds
norm = 2828509.2474168343

The conda-forge package doesn't seem to work as well in this case for some reason??? Note: this is the same test code that I ran in the post. You can see that the Core-X 18-core does a little bit better with MKL (as expected) than the TR 3960x ... MKL is highly optimized for Core and Xeon for this kind of problem, it's mostly DGEMM going to the AVX512 vector unit...

for completeness here's NAMD STVM on this CPU
Info: Benchmark time: 18 CPUs 0.379879 s/step 4.39675 days/ns 4820.6 MB memory
The Ryzen 3950x 16-core does this in job faster at 4.2 days/ns !!

ooops, does a bit better with hyperthreading in use,
Info: Benchmark time: 36 CPUs 0.323866 s/step 3.74845 days/ns 5569.66 MB memory

Posted on 2019-11-30 02:15:00
Methylzero

Ooops, I think I was wrong about AVX-512 in 0.3.7, that version just disabled more AVX-512 code, due to at the time unresolved bugs causing wrong results. Trunk should now have the fixed asm kernels and whatnot and AVX-512 enabled.

Posted on 2019-12-03 13:47:37
Ned Flanders

I understand you included a word of caution for the usage of the MKL in AVX2 Mode on AMD CPUs. But this is nowhere near a new tweak. It never made it out of the HPC department but is used there for a long time already. Its not dangerous to use it and AMD CPUs do not give incorrect results using this tweak. Yet, OpenBLAS, if available,is the lib of choice. Fully agree with that, and I sincerely hope that Matlab will include a choice between MKL and OpenBLAS in future releases. Bundling a software with the MKL without offering alternatives is a wrong.

Posted on 2019-11-29 21:52:37
Donald Kinghorn

Hey Ned :-) Yes, I do agree with you, but I did decide to put a caution in there (while I was writing the post) just in case there is something strange with with how it responds with Zen2. I really didn't do the depth of testing I would do in a critical production environment ... but yes, it's highly unlikely that there are any problems.

I think it is great that you and others have exposed this old hack ... I had completely forgotten about it! I haven't really seriously used AMD stuff since I was building Opteron clusters "back in the day" ... This is a wonderfully simple way to get proper performance on AMD hardware for some of these important programs that linked to MKL for BLAS and Lapack calls.

My biggest fear is that, with this spreading around the net, Intel will pull the plug ... because they could ... (or at least make it more difficult)

I am pretty impressed with BLIS v2.0 and OpenBLAS is really very good (I used Goto's libs years ago). With AMD back in the game I hope that devs and ISV's will start offering alternatives to a default link to MKL

I'm also really impressed the the new Ryzen and Threadripper processors! I think the 64-core TR will be tremendous for a codes like NAMD, really looking forward to testing that.

Posted on 2019-11-30 00:04:46
Misha Engel

Maybe you can get your hands on a EPYC 7H12, it's also a 280 Watt part with 64c/128tr, when pcie 3 is fast enough you can use an old board.

Posted on 2019-11-30 00:43:43
Donald Kinghorn

would seriously love that :-)

Posted on 2019-12-02 22:05:10

See, thats exactly why I published it. "Awareness!". My feeling was that many simply implemened the MKL and when they or their customers realized that AMD was slow... well... it was slow... its AMD.
The way this MKL workaround spread the net like a wildfire only confirms this. Now at least, everyone knows what the problem is and I sincerely hope that a vendor string discriminating piece of software will not be implemented anymore without alternatives. This has always been wrong but now, people know at least. So in case Intel pulls the plug, companies like Mathworks would face even more pressure to implement alternatives. To not support the fastest CPUs (Threadripper and 3950x) on the marked is not a good idea. And as you say.... alternatives are there. OpenBLAS is actually very good, and gets better with every release. Again, thanks for picking this up and providing some benchmarks here!

Posted on 2019-11-30 19:04:29
Flanders

Donald, I just found this first AMD Epyc sys in the Top500. https://www.top500.org/syst... Check out the "Software" Section. I strongly believe, that they must have tested the debug AVX2 mode thoroughly. Actually, I strongly believe that Intel must officially allow the usage of the debug mode. In two weeks or so, Matlab 2020a will be released and I heard "rumors" that they will implement this mode into the official production release (by the way, you should test that!). I really hope that SymPy, NumPy the Conda people along with MSR will follow the same course. I think that makes your "Big Caveat" section a pretty tiny one.

Posted on 2020-03-05 14:41:57
Donald Kinghorn

Ha, cool! Looks like they used Intel compilers and MPI too :-) Yes, I hope that "Big Caveat" will just go away. I don't completely trust that everyone will play nice, but we can hope ...

I would love to test Matlab. I used it a lot when I was an academic. In fact, I used Cleve Moler's Fortran code that became MatLab when I was an undergrad. I've been trying to talk MathWorks out of a license for years :-)

OpenBLAS is looking pretty good on Threadripper. When I did testing on the 3990x running numpy norm(A@B), I tried numpy with MKL(debug) and OpenBLAS and OpenBLAS gave better performance ... that's not what I saw when I tested the Zen2 Ryzen's

More news, ... we are going to qualifying EPYC at Puget. We'll focus mostly on single socket and a couple of high end dual's. I was a little disappointed in the scaling to 64 cores on 3990x with some of my tests ( overall performance was good though) I think that EPYC may do better because with memory performance.

Posted on 2020-03-06 03:16:00

Matlab offers a free Demo version for 4 weeks ;-) Not long but long enough for testing. And great news regarding the EPYCs at Pudget! If I was still US based, I would certainly have done business with PS meanwhile. Most approachable and experienced system builders out there. Keep up the good work!

Posted on 2020-03-06 20:36:11
mockingbird
Intel has every right to do that! It IS their stuff.

The FTC said otherwise.

https://www.ftc.gov/sites/d...

"IT IS FURTHER ORDERED that Respondent shall not make any engineering or design change to a Relevant Product if that change (1) degrades the performance of a Relevant Product sold by a competitor of Respondent and (2) does not provide an actual benefit to the Relevant Product sold by Respondent, including without limitation any improvement in performance, operation, cost, manufacturability, reliability, compatibility, or ability to operate or enhance the operation of another product; provided, however, that any degradation of the performance of a competing product shall not itself be deemed to be a benefit to the Relevant Product sold by Respondent. Respondent shall have the burden of demonstrating that any engineering or design change at issue complies with Section V. of this Order."

Posted on 2019-12-02 13:56:50
Neo Morpheus

Doesn't matter, Puget is in Intel's pockets. No wonder that this whole site always seems to recommend Intel systems, regardless of Intel CPUs being more expensive and slower than AMD ones.

Posted on 2019-12-02 17:27:56

That would be a shame for this site. I would hope that benchmark data dictates what CPU is better rather than some kind of Intel bias.

Posted on 2019-12-02 18:01:24

I'm not going to comment on the whole FTC ruling (it is 23 pages and I'm not even going to pretend I understand all of it), but saying we are in Intel's pockets is not at all accurate. We are an Intel partner, but we are also an AMD partner - not to mention NVDIA, Samsung, etc. If you read through our recent articles ( https://www.pugetsystems.co... ), you will see that we show that with the latest Ryzen and Threadripper CPUs, AMD comes out on top of Intel outside of a few isolated cases where Intel keeps a slim lead at certain price points. I'm not sure where people get the idea that we are paid off by Intel - we have always been about getting our customers the fastest and most reliable product for their workflow. We make the same margins whether it is AMD or Intel, so why would we ever offer a sub-par product to our customers?

Now, it is true that if AMD and Intel are very close in terms of price and performance that we will lean towards Intel for our customers. A lot of that is simply the fact that we have a TON of experience with the Z390 and X299 platforms since Intel has been pretty dominant in the markets we cater too for so long. We know their quirks and how to mitigate them, and have solid engineering contacts with both Intel and the motherboard manufacturers when issues come up. We don't have that with AMD quite yet, and that is something that can only be gained over time. Thunderbolt support that we know works is also a big factor since we have a significant number of customers who are moving from Mac to PC that need it.

As we get more sales and experience with AMD Ryzen and get Threadripper qualified up and start selling it as well, we may start shifting more of our systems where Intel and AMD are neck-in-neck over to AMD, but it is going to depend on whether or not any issues come up and how severe those issues are. Our customers are overwhelmingly not tinkerers or even all that interested in computer technology (which I suspect most of our article readers are), and they are more than willing to sacrifice a bit of performance in order to guarantee stability. It is the same reason we don't do overclocking - a bit more performance in exchange for even the chance of a few more crashes a month/year/whatever simply isn't a good exchange for our customers.

Posted on 2019-12-02 18:05:50

Thanks for clarifying. I did hope that you were not in Intels' pocket as suggested by Neo Morpheus even if you lean towards Intel if neck and neck on benchmarks.

...they are more than willing to sacrifice a bit of performance in order to guarantee stability


Does security fall under stability? I find it quite odd that people are willing to risk Intel when they have now-published unfixable security vulnerabilities just waiting to be written into some consumer targeted hack tool. It makes me nervous to run Intel, which I currently do.

PS. Your Link has a ")" in it causing 404

Posted on 2019-12-02 18:17:59

Security I think is different than stability. This is getting into my personal opinion here, but a lot of the security concerns recently are not as big of a deal as some people make it out to be. Many of them are definitely a problem for servers, but for a workstation they require such a specific set of circumstances for them to actually be a problem. I'm relying on what I've been told from people way more informed about this stuff than I am, but if someone needs physical access to the machine in order to take advantage of the flaws that isn't too big of a deal IMO. If someone already has physical access, you are in trouble no matter what.

Thanks for the mention of the link error - I got that fixed.

Posted on 2019-12-02 18:51:23
libastral

Also these fixes degrade performance, mostly on Intel, since AMD architecture is far less vulnerable by design. Something to keep in mind when choosing a pricey workstation that's supposed to last many years.

Posted on 2019-12-02 21:20:01
Larry

I'll take Matt's word over you guys any day since he's the one who work with these systems on daily basis.

Posted on 2019-12-03 19:38:53
Donald Kinghorn

This really is not true ( Matt lays it out well) ... the fact is that AMD is back in the game in a serious way with the new Zen 2 core CPU's. I hope we will be able to look at EPYC too ... I'm pretty blow away with the new Threadripper and really looking forward to trying the 64-core! We will do extensive testing and make recommendations as appropriate ...

You know, there is some great hardware coming out, and innovation is happening again. It's an exciting time to involved with it!

Posted on 2019-12-02 22:04:05
Donald Kinghorn

It's a shame that AMD "settled" on that. It seems that the only real thing Intel "had" to do was put that "Optimization Notice" on there docs! But it doesn't matter. This is now and we need to move on. AMD is on a roll. Their new BLIS v2.0 lib is looking good and OpenBLAS is also excellent. I'm planning on trying some optimized code builds linked with BLIS to see what performance we can get. AMD and the community need to focus on building "ecosystem".

Posted on 2019-12-02 22:11:11
Larry

Thing is.. the AVX/AVX2 on AMD CPUs aren't changed. There's no performance loss with them.

It's the software library and Intel has all the rights to do whatever they want with that software.

Posted on 2019-12-03 19:54:02
La Frite David Sauce Ketchup

"Ford Cars get up to 20% more MPG when running on Ford Gas*.

*By intentionally not shifting into the OD gear when non-Ford Gas is detected."

Good exemple of what you saying with intel have the right to destroy amd perf on software
Have a company like intel spend money to de-optimised amd cpu

big lol

Posted on 2019-12-02 17:48:38
Kaptein Sabeltann

ok boomer

Posted on 2019-12-02 17:50:47
Donald Kinghorn

ha ha yup I've been doing HPC since it was a thing ... really ... was one of the first folks to build a Linux cluster for computational chemistry. ... and yes we used AMD Opteron for that

Posted on 2019-12-02 21:58:08
Kaptein Sabeltann

Be careful there, or intel is going to cut your pay.

Posted on 2019-12-07 04:18:00
Donald Kinghorn

Maybe AMD will make up the difference (they're getting a lot of love from me recently) ... I got to admit I'm pretty fond of ARM too but no one believes me when I tell them that's the platform of the future

Posted on 2019-12-09 15:45:11
Hifihedgehog

I am disappointed by the level of incompetence and misunderstanding this industry professional demonstrates about the creation of libraries and the anti-trust violations involved here. I suppose he is unfamiliar with Microsoft and the decades of examples of precedent that has been established that prohibit this kind of behavior. AMD's Ryzen processors fully support AVX so Intel manually restricting its use in an industry-wide ubiquitous library is like forcing the opposing team to wear weighted shoes and combat packs at a sports match at your home stadium. You can claim that your fans are the ones mostly attending here and your team invested in the stadium as a reason for forcing the other team to play under different conditions but such action is still a clear violation of league regulations.

Posted on 2019-12-02 18:35:07
libastral

Yeah, I'm astounded that so many people fail to realize the issue here is deliberate competitor sabotage, not just "Intel not optimizing for AMD". I feel that this reddit comment thread summarizes the issue perfectly.

Posted on 2019-12-02 21:18:38

Except that the Reddit post takes a comment from this blog, by a single employee here at Puget Systems, and portrays it as the company's stance... and then also ignores the fact that it was said in the intro to a post about *sharing with the wider community how to get around Intel's dumb choice*. Dr Kinghorn was trying to be helpful, and spread word of a work-around that massively improves performance on AMD with this software, and he is getting lambasted over one paragraph that doesn't line up with what some (many? most? does it matter?) readers believe. "Biting the hand that feeds" seems like an appropriate metaphor in this case :/

I am not a lawyer, but it seems to me that there is a difference between the "right" to do something and whether or not it is a good / smart / moral decision. If someone writes software that straight-up does not run on a certain CPU (or brand of CPUs)... I suppose that is their right? Its a stupid decision, in my opinion, and I would consider it to be morally wrong, but that doesn't mean that they are legally prohibited from it. Maybe there is something more in the rulings that have been passed down regarding this stuff specific to Intel (again, I am not a lawyer) that specifies that they are not legally allowed to do stuff like this, but if not then it seems to me correct to say that they have the "right" to do it... but also that it is a really dumb thing to do, and bad for the tech community as a whole. Please note, however, this is *my opinion* and does not necessarily reflect the views or opinions of Puget Systems as a whole :)

Posted on 2019-12-02 21:42:23
Hifihedgehog

There is no denying he had good intentions when he wrote his piece. Intel shilling wasn’t my beef here. I am just highly disappointed by his “Intel has every right” line. That is miscategorizing anti-trust violations as something up to moral debate and not a legal issue, an assertion Puget’s other staff also made. This is completely inaccurate and is a stark reflection of his and Puget’s misunderstanding of the legal ramifications at play here. Intel may have billions of dollars and 10,000’s of employees behind this, but they still cannot exclude a competitor’s product from AVX optimization. There are already constructs in place to programmatically verify proper AVX support in a processor at a lower level than just the name of the CPU manufacturer. This would be akin to blocking 64-bit support for Intel products on AMD libraries just because AMD could not independently verify compatibility even though Intel processors with the AMD64 flag enabled should produce a 100% identical result in conforming to the AMD64 instruction set.

Posted on 2019-12-03 00:10:44

That is a fair criticism, and as far as I know none of us here at Puget are lawyers or particularly up to date on legal stuff like that :)

Posted on 2019-12-03 00:16:05
Donald Kinghorn

That line kind blew up on me :-) Your take is right on! This kind of thing has been going on for so long that maybe I've become numb to it when in the past I would have been outraged. I can tell you, I was outraged when it first happened years ago. I couldn't believe that Intel got away with it and that AMD "settled". At that point I just used PGI compilers and Goto's library (which has become OpenBLAS) Performance was good with that.
Maybe it's a good thing that people are riled up again about this kind of thing. There is lots of new hardware coming up including really innovative stuff. We need to make sure that good things don't get squashed by the "big guys".

Posted on 2019-12-03 00:59:37
gparmar76

The CPUs are all x86 and AMD pays a license for it..Intel deliberately crippled AMD here and that much is clear. It has nothing to do with "owning" the software.. perhaps now would be a good time to read up on anti competitive laws.

Posted on 2019-12-03 03:14:04
Methylzero

And all Intel CPUs are using the AMD64 extensions. AMD and Intel agreed to license a lot of their patents to each other for compatibility reasons.

Posted on 2019-12-04 14:18:11
Donald Kinghorn

Yes, it really is sad. I'm a big fan of open source (open everything) and sharing. The "issue" with MKL and the Intel compilers in general was a big disappointment. When Intel launched their compilers they were looking really good so everyone want to use them but AMD was doing some great CPU's at that time and then this whole stupid "disable optimizations" thing came up. Myself and the people I worked with used PGI compilers and do so for may years.

Then AMD kind of just stopped and folks shifted to Intel compilers and MKL. Once Intel opened the MKL dev libs for free use a lot of ISV started linking to it by default since most high performance work was being done on Intel. Now that AMD has killer CPU's again this whole thing has come to light again.

I really hope that AMD will be able to get the resources to keep up the work on BLIS and keep their momentum going on the hardware.

Posted on 2019-12-02 21:55:51
Behrouz Sedigh

AVX,AVX2,SSE2,SSE3,SSE4 , Those extensions are Free extensions (Base on agreement between Intel/AMD) that AMD can use Those extensions on their CPU and Intel can't make compiler by writing this Code :

If Cpu = i5/i7/i9
then use AVX2
otherwise
use SSE2

This is illegal.BUT :

If Cpu = i5/i7/i9
then use Non-Free Specific extension
otherwise
use Standard extension

Or

If Cpu = i5/i7/i9
then use Hack Method = Go to B then C then A then D
otherwise
use Standard Method = Go to A Then B then C Then D

Or

If Cpu = i5/i7/i9
then AVX2
otherwise
Get Error
// this means only available on specific CPU

This is legal.this is called "OPTIMIZATION"

Posted on 2019-12-02 19:54:41
Ned Flanders

I can tell you how exactly it works, because I looked into the mkl.dll code.

The mkl.dll makes a vendor string query (not If CPU=i5 etc.). It specifically checks the vendor string. If this one is not "GenuineIntel" (You can actually see the correct answer in the code) than jump to SSE2. If the answer is "GenuineIntel", the code jumps to a feature set query. Depending on the reported feature set, it decides on the codepath.

In fact, you can modify the mkl.dll and replace "GenuineIntel" by "AuthenticAMD". Such a patched mkl.dll accepts this as the correct answer and chooses the correct codepath on AMD CPUs accordingly. This patched mkl.dll on an Intel CPU only runs SSE2.

So... it is not optimization and in fact Intel invested resources to code a vendor string query which is not doing any benefit to its own cpus but is nothing different but a performance kill switch for AMD CPUs.

Hope that helped to clarify the situation.

Posted on 2019-12-18 18:27:55
Donald Kinghorn

love it, thanks for posting that

Posted on 2019-12-18 22:42:02
Ned Flanders

You can even find that in a hex viewer
https://abload.de/img/mklhe...

Posted on 2019-12-19 12:48:51
Leandro Ferrero
Maybe you're thinking that it's not "fair" for Intel to do that, but ... Intel has every right to do that! It IS their stuff. They worked hard utilized a lot of resources to develope it. And, there IS some incompatibility at the highest (or lowest) levels of optimization for the hardware. MKL is insanely well optimized for Intel CPU's ... as it should be!

I have been buying intel on the last decade a lot. But Intel has no right to hurt the performance of the final users if we don't buy hardware from them. Because it's not ethical, and maybe can be illegal.

They have the right to do new optimizations, and when the compiler found that optimizations are officially supported they should be used by the compiler.

If every company does that... maybe you can receive 80v instead of 110v because you are connected to the electrical network of another company. "They have every right to do that, because they spend money and resources on generating that electricity"... that logic is absurd.

That is not fair competition at all. Imagine Apple throttling internet connection, on google apps like google chrome on iOS. That's fair? they have every right too?

They have to be stopped. That's wrong, period.

Posted on 2019-12-03 00:58:41
Donald Kinghorn

Sorry, you are right. I've added a note around that paragraph ... It's not right and I should not have tried to justify it because I honestly think it was the worst sort of corporate tactic!

Posted on 2019-12-03 02:02:27
Leandro Ferrero

I respect that.

Posted on 2019-12-03 02:38:59
yv

I know I'm beating a dead horse and the thread is old, but was that a fair statement? If I understand it correct the MKL is a sort of Intel's marketing tool. They invested substantial amount of resources into development and maintenance of the library. They do not charge for the library. Is it fair to require Intel to market AMD processors as well? How is that different from the robbing? The end users of MKL@AMD do not compensate Intel their work - is this a stealing?

The post above is full of bright but irrelevant analogies.

btw. patching the library file mentioned in the Ned Flanders' post above is in violation of DMCA as far as I know, setting the environment variable must be legal anywhere in the world IMO.

Posted on 2020-03-04 02:08:44
Donald Kinghorn

This whole thing goes way back. First in the UNIX world it was common for vendors to have an optimized BLAS/Lapack/DFT package for thier hardware. They were usually included with the OS on a machine. When DEC bit it, Intel got most of their compiler writers (MS got the VMS team that created WIndows NT) Intel developed their compilers together with MKL as a commercial package and sold licenses for years. When AMD started clobbering them with Opteron for cluster sales they fought back (ruthlessly). It got ugly in the early 2000's and there were law suites ... It's only recently (a few years) that Intel has made MKL freely available as a redistributable bundle. This led to software vendors using and including it in their distributions. You used to have to recompile code yourself to link to MKL, or do some config change (if you had commercial license to use it). And, software vendors would have to maintain separate code branches if they wanted to use static linked builds. A lot of people used Portland Group compilers because of this ...

The fact that MKL is now "free" and linked by default with some important packages kind-of changes things! It's a mess and a shame really.

I'll add that in my recent testing on the new Threadrippers I'm seeing OpenBLAS out perform MKL! That's great and how it should be ... but this still leave the problem of ISV's doing default links to MKL. I really don't know what to do about it!?? The ISV's are in a bit of a bind because they want to have optimal performance with their code ... that usually means MKL ... but now that AMD is back stronger than ever ... well, I think we are in a transition period. My hope is that OpenBLAS (and other open libs) will just improve enough that it is the best choice for ALL platforms and ISV will use it by default!

Posted on 2020-03-04 16:30:52
lhl

Since I got curious, Here are some test results on my Ryzen 3700X system (benchmarked using this little script by Markus Beuckelmann that I can't seem to link but should be easy to find by searching that name and numpy BLAS if anyone wants to replicate). My results:

# blis 0.6.0 h516909a_0
Dotted two 4096x4096 matrices in 2.30 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.94 s.
Cholesky decomposition of a 2048x2048 matrix in 0.24 s.
Eigendecomposition of a 2048x2048 matrix in 6.36 s.

# AMD BLIS 2.0
Dotted two 4096x4096 matrices in 2.33 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.78 s.
Cholesky decomposition of a 2048x2048 matrix in 0.23 s.
Eigendecomposition of a 2048x2048 matrix in 5.84 s.

# libopenblas 0.3.7 h5ec1e0e_4Dotted two 4096x4096 matrices in 0.41 s.
Dotted two vectors of length 524288 in 0.02 ms.
SVD of a 2048x1024 matrix in 0.55 s.
Cholesky decomposition of a 2048x2048 matrix in 0.14 s.
Eigendecomposition of a 2048x2048 matrix in 5.53 s.

# mkl 2019.4 243
Dotted two 4096x4096 matrices in 1.53 s.
Dotted two vectors of length 524288 in 0.02 ms.
SVD of a 2048x1024 matrix in 0.51 s.
Cholesky decomposition of a 2048x2048 matrix in 0.29 s.
Eigendecomposition of a 2048x2048 matrix in 4.79 s.

# export MKL_DEBUG_CPU_TYPE=5
Dotted two 4096x4096 matrices in 0.33 s.
Dotted two vectors of length 524288 in 0.02 ms.
SVD of a 2048x1024 matrix in 0.29 s.
Cholesky decomposition of a 2048x2048 matrix in 0.15 s.
Eigendecomposition of a 2048x2048 matrix in 3.29 s.

I was setup AMD BLIS 2.0 by simply setting up regular BLIS and extracting the binaries into that conda environment's lib/include and it seemed to work hunky dory (and was a little faster, but not better than OpenBLAS or even the gimped MKL in these tests).

Posted on 2019-12-08 22:35:36
Donald Kinghorn

thanks for doing some testing! Interesting way to bump the BLIS version :-) that's like something I would try ...
I did try linking numpy in anaconda with the blis 0.6 package but got disappointing results like you saw

I will be working on this over the next several weeks ... Thinking ahead to the 64-core TR and updated chipset

One of the first things I'll put up as a post is how to setup a docker environment with compiler and lib support for Zen 2 core app development/optimization.

I got really good results with the BLIS linked HPL Linpack which makes me think we should be seeing much better results for things like numpy ...???

Posted on 2019-12-09 15:58:44
Alexander Pivovarov

Thank you so much for this article (and very good name chosen for it too)! I just finished a brand new build with Ryzen 3950x and was a bit disappointed with the performance improvements. The MKL environment variable trick gave me quite a bit of additional boost.

Posted on 2019-12-12 04:54:43