Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1303
Dr Donald Kinghorn (Scientific Computing Advisor )

AMD Threadripper 2990WX 32-core vs Intel Xeon-W 2175 14-core - Linpack NAMD and Kernel Build Time

Written on December 6, 2018 by Dr Donald Kinghorn
Share:


When I did my recent AMD Threadripper 2990WX HPL Linpack "How-To" most of the time I had with the processor went into getting that to work. However, I did run a few other test jobs that I thought the 2990WX would do well with. I compared that against my personal workstation with a Xeon-W 2175. In this post I share those test runs with you. It's not thorough testing by any means but it was interesting and I was surprised a couple of times with the results.

The post How to Run an Optimized HPL Linpack Benchmark on AMD Ryzen Threadripper -- 2990WX 32-core Performance is a good read for a high performance compute perspective on Threadripper.

Note: I have results for the NAMD jobs runs including an NVIDIA Titan V. NAMD has good GPU acceleration but needs a lot of CPU performance to balance that.


Test systems: AMD 2990WX and Intel Xeon-W 2175

The AMD Threadripper system I used was a test-bed build with the following main components,

AMD Hardware

  • AMD Ryzen Threadripper 2990WX 32-Core @ 3.00GHz (4.2GHz Turbo)
  • Gigabyte X399 AORUS XTREME-CF Motherboard
  • 128GB DDR4 2666 MHz memory
  • Samsung 970 PRO 512GB M.2 SSD
  • NVIDIA Titan V GPU

The Intel system is my personal workstation

Intel Hardware

  • Puget Systems Peak Single
  • Intel Xeon-W 2175 14-core @ 2.5GHz (4.3GHz turbo)
  • ASUS C422 Pro SE (My sys board, the Peak Single uses the very nice ASUS WS C422 SAGE/10G )
  • 128GB DDR4 2400 MHz Reg ECC memory
  • Samsung 960 EVO 1TB NVMe M.2
  • NVIDIA Titan V GPU

Software


Linpack

I'll start with the Linpack benchmark that I went to great pains to optimize and compile for AMD Threadripper. For the Intel benchmark I'll just use the omp threaded binary included with Intel MKL.

linpack chart

Linpack is my favorite benchmark for CPU performance because it exposes near maximium processor double precision floating point numerical performance. It is a good measure of how well optimized (vectorized) numerical linear algebra, matrix/vector algorithms will perform. That is basis for the majority of scientific high performance computing software.

The Intel "X-series" and Xeon -W or -SP are much faster for this kind of intense compute workload. Here's why,

  • AMD Threadripper 2990WX -- 32-cores, each core has 1 AVX2 (256bit) vector units == 597 GFLOPS
  • Intel Xeon-W 2175 -- 14-cores, each core has 2 AVX512 (512bit) vector units == 838 GFLOPS

The Intel AVX512 vector units provide great performance for well optimized code ... but keep in mind not all code is well optimized (or at least not well vectorized).

NAMD

NAMD is a molecular dynamics program. It has very good parallel performance. It has good parallel scaling on systems ranging from multi-core workstations to the largest supercomputers. It also has very good GPU acceleration. GPU's greatly improve performance but NAMD has a significant portion of code that needs to run on CPU. It is important to get a balance between CPU and GPU for best hardware utilization.

NAMD ran really well on the 2990WX!

namd chart

Note that the Threadripper 2990WX was nearly twice as fast as the Xeon-W 2175 for the CPU only runs. With the addition of the Titan V the performance was much closer for both processors. This may indicate that the 2990WX would do better with more GPU's. I don't know that for sure because I didn't test with multi-GPU's. Balance is important with NAMD and you need a lot of CPU performance to keep up with newer NVIDIA GPU's. It would be interesting to test the 2990WX with 2 x 2080Ti's or perhaps 4 x 2070's. If there is interest I may see if I can do that testing.

Linux kernel build time

I ran this benchmark from the Phoronix test suite. I wanted to see the non-floating point performance on the 2990WX and compiling a large code base is an important application. Compiling a software package consisting of a large number of source files can often be done with a lot of parallelism. I expected that the Threadripper 2990WX with it's 32-cores would be a good processor for this and, indeed, it did quite well.

kernel chart


That's all of the performance testing results I have for the Threadripper 2990WX. I still feel that the Intel X-series and Xeon -W (and -SP) are "generally" better processors for HPC workloads. But, note that I said generally! I'm thinking about code that I would compile and optimize myself! I would make an effort to get good vectorization and for that the Intel AVX512 vector units are great.

How do you know if the programs you want to run will do well on the high core count AMD Threadripper processors?

Here are a few considerations;

  • Try to find trusted performance evaluations for the software you are going to use (with similar types of job runs). This can be tricky because despite Google's great power some stuff just doesn't get posted online (or it's just very poor quality). You could try asking for advise in user or developer forums for recommendations. You could also try bugging me to do it but you might not succeed with that.

  • Understand the parallel scaling performance of your code. Try running it on your current system with from 1 to the maximum number of processors you have. If it doesn't scale close to linear then having a large number of cores is likely not going to help. Keep in mind that scaling may fall off rapidly for more than 4-8 cores. Think Amdhal's Law!

  • If you know that your code or job run workflow scales nearly perfectly because it's "embarrassingly parallel" or you just have lots of jobs to run simultaneously then a high core count processor may be great. Keep in mind that you could run into memory contentions or space limitations or I/O limitations.

  • If you can, try to see if your program has good vectorization or not. You may be able to turn off AVX in the BIOS. If you are building from source try compiling with noavx (or similar for your compiler). If you don't see much performance change with AVX disabled then those great AVX512 vector units on the Intel processors won't be doing you much good. If you cannot fix that then maybe you are best off just considering lost of cores.

I enjoyed the Threadripper 2990WX. Having 32-cores to play with makes you think differently. It's like a "single node cluster" ... kind of ... really it felt a lot like Quad socket system that would have cost $25000 a few years ago!

Happy computing --dbk

Tags: Threadripper, Ryzen, 2990WX, Xeon-W, Linpack, NAMD, HPC, Linux
NOS

Good job!

I am eager to see what the Linpack test result for Intel Xeon-W 3175X will look like. My wild guess is that it will be ~1.4 TFLOPS

Intel is positioning W-3175X as the Threadripper 2990WX competitor on the HEDT.

I am surprised to see a 14-core (2.5GHZ) CPU decimate a 32-core (3GHZ) CPU in this test!!!

Posted on 2018-12-09 02:53:16
Donald Kinghorn

I was surprised too ... in both directions ... I guess I wasn't too surprised to see the Xeon-W clobber it on Linpack all of the Skylake-X -W -SP processors have AVX512 and the newer X-series too. (and all but the low-end Xeon-SP have 2 per core) These processors also clobber Intel's other CPU's like Coffee Lake.

I was a little surprised by how well TR did with NAMD and I expected it to do well in many process integer stuff like compiling. If I can I'm going to do some more testing with NAMD using multi-GPU's ... looking for the performance "sweet-spot"

Posted on 2018-12-10 17:10:49
bernard gingold

Theadripper 2990WX has 2 FP MUL units working on 128-bit vector registers and 2 FP ADD units operating on 128-bit vector registers. L1 Cache bandwidth is a half of bandwidth available on Skylake u-arch.
In general for highly optimized and easily vectorized code 2-FMA 512-bit Skylake CPU will easily outperform AMD ZEN u-arch even though, PMU unit on higher end Xeon (i.e. Gold SKU) will throttle reference clock speed even by 20%-30%.
In the term of raw computational(theoretical limit) speed 2 FMA 512-bit units can process 32 double precision floating-values versus 8 double precision floating-values (ZEN).

Posted on 2018-12-18 11:54:58
TA Nie

Very interesting test! I did not know each xeon had 2x AVX512 per core! I've read how intels AVX512 is better optimized however I would not expect 32 AVX512 would be beaten so greatly by 28 intel AVX512 cores.

AMD has some work to do here, but if they get those optimizations done the Threadripper CPU would be clearly overall better... for far cheaper.

Posted on 2019-01-14 17:58:29
Donald Kinghorn

AVX2 is very good but AVX512 has more operations and twice the vector bit width. There is a catch though, when AVX512 units are under load they are on their own clock and the core clock lowers in order to maintain power load. The overall effect is that instead of potentially twice the performance it is more like 40-50% improvement ... which is really not bad!

There is no reason that I know of that AMD would be prohibited from implementing AVX512. I would expect to see it in their next gen procs. They might do something better even??? ARM even has vector units for some of the newer compute oriented chips. They are potentially serious compute competitors along with AMD!

The number of AVX512 units is almost a hidden spec. The lower level Purley Xeons are the only "skylake core" processors that I know of for sure with only 1 unit/core. It is probably a "yield" thing. Rather than toss them out they lower the overall spec and disable stuff that doesn't work.

Posted on 2019-01-14 19:47:20
Nicholas Johnson

I do not believe it is possible to compile the Linux kernel in 35 seconds....
Are you sure you did "make clean" before you ran it?
I have a 6-core / 12-thread i7-4930K @ 4GHz system here and just timed compile of Linux 5.0-rc4 at 18m 58.067s - so with 5-6x the cores, shouldn't it be more like 3 minutes? This was -j64.

Posted on 2019-01-31 09:03:18
Donald Kinghorn

yup it is pretty amazing, in the-old-days you used to be able to go have lunch! ... That is the kernel build benchmark from the Phoronix benchmark suite. I used Michael's docker image for the test suite. He has 32 seconds on his testing for the 2990WX https://www.phoronix.com/sc... You could install the suite and try it on your system to get a better comparison.

The page for this benchmark is https://openbenchmarking.or... I looked at the revision history. Looks like it is downloading a 4.18 kernel, and yup, there is a make clean in there ...

I don't use Michael's stuff very often but it is pretty good. I wanted to try his docker image, it was really convenient for me since I always setup docker on test systems.

Cheers Mate!

Posted on 2019-02-01 01:40:20