Puget Systems print logo

https://www.pugetsystems.com

Read this article at https://www.pugetsystems.com/guides/1370
Dr Donald Kinghorn (Scientific Computing Advisor )

Intel Xeon W-3175X and i9 9990XE Linpack and NAMD on Ubuntu 18.04

Written on February 28, 2019 by Dr Donald Kinghorn
Share:

Introduction

I was able to get a little time in on the Intel Xeon W-3175X and the Core i9 9990XE processors. I ran a couple of numerical compute performance tests with the Intel MKL Linpack benchmark and NAMD. I used the same system image that I had used recently to look at 3 Intel 8-core processors so I will include those results here as well. There will be results for W-3175, 9990XE, 9800X, W-2145, and 9900K.

Intel has been doing some strange things recently. They are in shortage on many of their processors so we are seeing some models that would not normally appear in their line. For example some of the processors with integrated GPU's that are coming out of the fab with faulty GPU sections, but perfectly good CPU sections, are being being released with the GPU disabled.

There are 2 recent Intel processors that are really strange, the Xeon W-3175X 28-core, and the Core i9 9990XE overclocked 14-core. I don't know for sure if these processors are the result of fab "problems" or not. The W-3175X could be an overclocked Xeon Scalable 8180 that had problems rendering is useless in a multi-socket system and the 9990XE could be a 9980XE 18-core that had a few bad cores that when disabled allowed for overclocking of the remaining 14-cores.??? That's pure speculation based only on my wild imagination!

Neither of these processors are actually available other than by an odd auction process in batches to OEM's. The 9990XE does not have a warranty and I cannot find it on "Intel Ark". The W-3175X is listed on Intel Ark. To me, the i9 9990XE does not appear to be a real product and I don't understand why Intel would sell it to anyone without a warranty. They are not supporting it in any tangible way. The W-3175X at least "looks" like it might be a real product, but with no promise of availability or predictable pricing.

The W-3175X requires a special motherboard and cooler and surprisingly both ASUS and Gigabyte have made boards available. They are extremely large boards and they are using the Xeon Scalable C621 chipset.

We did get these two odd-balls in for testing at Puget Systems. My colleagues Matt and William did do a fair amount of testing with popular software programs running on Windows 10. You can find several of their posts listed in the Puget Systems "Articles" section. This one is particularly good for the 9990XE, Intels Core Xperiment i9 9990XE. I very much agree with the sentiment in this post that the 9990XE is just an experiment, it's not a viable product. Also in practical terms the 3175 is not a viable product either.

Caveats aside, I was of course curious about the raw double precision floating point performance of these monsters so I popped into Puget labs and fired up Ubuntu 18.04 on them and ran the Intel optimized Linpack benchmark from MKL. That is in my opinion the best relative measure of numerical compute performance for Intel processors.


Processor Specs i9 9900K, i7 9800X, Xeon W-2145, i9 9990XE, Xeon W-3175

The following table list some of the specification differences between these processors relevant for consideration in a numerical computing workstation configuration.

i9 9900K, i7 9800X, Xeon 2145W, i9 9990XE, Xeon W-3175 Features

Features i9-9900K i7 9800X Xeon W-2145 i9 9990XE Xeon W-3175
Code Name Coffee Lake Skylake-X Skylake-W Skylake-X Skylake-W
Cores 8 8 8 14 28
Base Clock 3.6GHz 3.8GHz 3.7GHz 4.0GHz 3.1GHz
Max Turbo 5.0GHz 4.5GHz 4.5GHz 5.1GHz* 3.8GHz*
All Core 4.7GHz 4.1GHz 4.3GHz 5.0GHz* 3.7GHz*
Cache 16 MB 16.5 MB 11 MB 19.25 MB 38.5 MB
TDP 95 W 165 W 140 W 255W 255W
Max Mem 64 GB 128 GB 512 GB (Reg ECC)128 GB 512 GB (Reg ECC)
Mem Channels 2 4 4 4 6
Max PCIe lanes 16 44 48 44 48
X16 GPU support 1 2 3 (4 w/PLX) 2 3 (4 w/PLX)*
Vector Unit AVX2 AVX512 AVX512 AVX512 AVX512
Price $500 $600 $1113 * $3000*

Notes:

Clock Frequencies: I will included some raw frequency monitoring output in an appendix. What I observed when running Linpack was this; For the 9990XE the job started with an initial frequency of 5.0GHz and it stayed there on all cores for the initialization of the job. When the AVX512 went under load the clock for all but 2 cores dropped to 3.1GHz. 2 cores remained near 5.0GHz. 3.1GHz is presumably the AVX clock frequency. For the job run with the W-3175 the initial clock was 4.3GHz then dropping to 3.7GHz all-core for the initialization and then dropping to 2.8GHz when AVX512 started.

PCIe: It is common for Xeon-W systems to support 2 or 3 X16 cards without a PLX switch. The motherboard we used had an X16,X8,X16,X8 layout.

Pricing: There is no official price for the 9990XE (it's not a product) see Intels Core Xperiment i9 9990XE. The W-3175 is listed as a real product on Intel Ark. It has an MSRP of approx. $3000. It also requires a special (massive!) socket 3647 motherboard which would cost close to $2000 and a really good cooler. There are many details that make a system utilizing the W-3175 processor a non-viable product. I understand the temptation to think that you "want one of those" but really, it looks like it is not supportable as a product.


Hardware under test:

There were 4 platforms used in this testing.

  • Intel Core i9 9900K 3.6GHz 8-Core

    • Gigabyte Z390 Designare Motherboard (1 x X16 PCIe)
    • 64 GB DDR4-2666 Memory
    • 1 TB Intel 660p M.2 SSD
    • NVIDIA RTX 2080Ti
  • Intel Core i9 9990XE 5.0GHz 14-Core and Core i7 9800X 3.8GHz 8-Core

    • Gigabyte X299 Designare Motherboard (2 x X16 PCIe)
    • 128GB DDR4-2666 Memory
    • 1 TB Intel 660p M.2 SSD
    • NVIDIA RTX 2080Ti
  • Intel Xeon W-2145 3.7GHz 8-Core

    • Asus WS C422 SAGE/10G Motherboard (4 x X16 PCIe)
    • 256GB DDR4-2666 Reg ECC Memory
    • 1 TB Intel 660p M.2 SSD
    • NVIDIA RTX 2080Ti
  • Intel Xeon W-3175 3.1GHz 28-Core

Software:

I had the OS and applications installed on the Intel 660p M.2 drive and swapped it between the test systems.

I am running Linux for this testing but there is no reason to expect that the same types of workloads on Windows 10 would show any significant difference in performance.


Results

Linpack

An optimized Linpack benchmark can achieve near theoretical peak performance for double precision floating point on a CPU. It is the first benchmark I run on any new CPU's. It is the benchmark (still) used to rank the Top500 supercomputers in the world. I feel it is the best performance indicator for numerical computation with maximally optimized software. The Intel optimized Linpack makes great use of the excellent MKL library. There are many programs that link to MKL for performance. This includes the very useful "numerical compute scripting" packages Anaconda Python and Mathworks MATLAB.

linpack chart

This is not necessarily a good selection of comparative results but hopefully it does give you idea of the relative performance. These are results utilizing the same test install system image and software versions.

The double precision floating point performance of the W-3175 is very impressive, as expected.

Note: These jobs ran with "real" threads since "Hyperthreads" are not useful for this calculation.

Note: The 8-core results are with a large problems size of 75000 simultaneous equations (a 75000 x 75000 "triangular solve") and used approximately 44GB of system memory. The 9990XE and W-3275 were tested with a problem size of 110016 using approximately 94GB of system memory. Also, note that the 9900K has a disadvantage on this benchmark since it has the older AVX2 vector unit.

NAMD

I also tested with the Molecular Dynamics package NAMD. NAMD scales really well across multiple cores and it is not specifically optimized for Intel hardware. It is highly optimized code and it uses the very interesting Charm++ for it's parallel capabilities. NAMD is an important program and I like it for testing since it is a good example of well optimized code that scales to massive numbers of processes and also has very good GPU acceleration that needs to be balanced by good CPU performance.

NAMD CPU

The AVX512 vector units are not that important for this code since it is designed to run well on a wide variety of hardware. Higher core counts are a big advantage for performance since NAMD has very good parallel scaling.

Note: These jobs ran with "Hyperthreads" since they help with the way NAMD uses threads. It is always worth experiment with Hyperthreads to see if they help or not.

Note: The performance units here are "days per nano-second" of simulation time. Adding a GPU will dramatically increase the performance as will be seen in the next chart.

NAMD GPU

The first thing to notice is that the performance has increased by over a factor of 10 by including the NVIDIA RTX 2080Ti!

Conclusions and Recommendations

I have to emphasize that the 9990XE and W-3175 processors are not really viable components for supportable products. They are more enthusiast curiosities than workstation components. This is especially true for the 9990XE, it has no support of any kind from Intel I don't even know what they were thinking. The W-3175 is more interesting but it is still not viable as a product because of the lack of commitment and supply as well as the "extreme" nature of the overall system platform needed to run it. So, don't even think about it!

On the positive side 2019 should be an interesting year for new hardware. We expect a new architecture design from Intel toward the end of the year (after a hardware security bug-fix refresh). The future platform should be a significant change over what we are using now including new chipsets supporting PCIe v4 among other niceties. We also expect the current supply issues to be resolved. Intel also, has other interesting hardware projects in the works and we may see some results from them for new compute accelerator hardware. And, that's just Intel ... AMD and ARM are looking really interesting too!

Happy computing --dbk

Appendix

9990XE raw data snippets

kinghorn@utest:~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              28
On-line CPU(s) list: 0-27
Thread(s) per core:  2
Core(s) per socket:  14
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Core(TM) i9-9990XE CPU @ 4.00GHz
Stepping:            4
CPU MHz:             1200.741
CPU max MHz:         5100.0000
CPU min MHz:         1200.0000
BogoMIPS:            8000.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            19712K
NUMA node0 CPU(s):   0-27
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
 dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon
  pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
   vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
    tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
     cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase
      tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx
       smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc
        cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp
         hwp_pkg_req flush_l1d
kinghorn@utest:~/projects/benchmarks/linpack$ ./runme_xeon64

Current date/time: Fri Feb  8 11:35:49 2019

CPU frequency:    4.999 GHz
Number of CPUs: 1
Number of cores: 14
Number of threads: 14

Parameters are set to:

Number of tests: 1
Number of equations to solve (problem size) : 110016
Leading dimension of array                  : 110016
Number of trials to run                     : 1    
Data alignment value (in Kbytes)            : 1    

Maximum memory requested that can be used=96830363392, at the size=110016

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
110016 110016 1      910.742    974.7487 9.762934e-09 2.885014e-02   pass

Performance Summary (GFlops)

Size   LDA    Align.  Average  Maximal
110016 110016 1       974.7487 974.7487

Residual checks PASSED

End of tests

Start of job run, (showing active "hyperthreads")

kinghorn@utest:~$ sudo cpupower monitor -m Mperf | sort -k2 -r
  24| 39.76| 60.24|  5009
  15|  0.64| 99.36|  5009
  14|  0.40| 99.60|  5009
  21|  0.13| 99.87|  5008
  20|  0.12| 99.88|  5010
  19|  0.12| 99.88|  5009
  16|  0.12| 99.88|  5006
  18|  0.10| 99.90|  5016
  25|  0.10| 99.90|  5008
  17|  0.07| 99.93|  5005
  26|  0.06| 99.94|  5009
  27|  0.06| 99.94|  4997
  23|  0.05| 99.95|  5017
  22|  0.05| 99.95|  5005
    |Mperf|               

Frequencies during AVX512 load,

kinghorn@utest:~$ sudo cpupower monitor -m Mperf | sort -k2 -r
CPU | C0   | Cx   | Freq
  14|  1.17| 98.83|  3116
  24|  0.77| 99.23|  3104
  23|  0.21| 99.79|  3479
  19|  0.12| 99.88|  3106
  25|  0.10| 99.90|  3221
  17|  0.10| 99.90|  3097
  21|  0.09| 99.91|  3144
  26|  0.08| 99.92|  5007
  18|  0.08| 99.92|  3101
  20|  0.07| 99.93|  3106
  22|  0.06| 99.94|  3211
  15|  0.04| 99.96|  3100
  27|  0.03| 99.97|  4871
  16|  0.00|100.00|  3072
    |Mperf               

W-3175 data snippets

from /proc/cpuinfo

processor	: 55
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) W-3175X CPU @ 3.10GHz
stepping	: 4
microcode	: 0x2000059
cpu MHz		: 3800.392
cache size	: 39424 KB
physical id	: 0
siblings	: 56
core id		: 30
cpu cores	: 28
apicid		: 61
initial apicid	: 61
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx
 fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
  rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est
   tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes
    xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti
     ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2
      smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb
       intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc
        cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_pkg_req pku ospke
         flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6200.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:
kinghorn@utest:~/projects/benchmarks/linpack$ ./runme_xeon64

Current date/time: Fri Feb 15 05:17:29 2019

CPU frequency:    4.289 GHz
Number of CPUs: 1
Number of cores: 28
Number of threads: 28

Parameters are set to:

Number of tests: 10
Number of equations to solve (problem size) : 10000 15000 18000 20000 22000 25000 26000 27000 30000 110016
Leading dimension of array                  : 10000 15000 18008 20016 22008 25000 26000 27000 30000 110016
Number of trials to run                     : 2     2     2     2     1     1     1     1     1     1    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     1     1    

Maximum memory requested that can be used=96830363392, at the size=110016

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
10000  10000  4      0.456      1463.3087 1.051521e-10 3.707768e-02   pass
10000  10000  4      0.450      1482.7231 1.051521e-10 3.707768e-02   pass
15000  15000  4      1.399      1608.9285 2.253401e-10 3.549145e-02   pass
15000  15000  4      1.395      1613.5717 2.253401e-10 3.549145e-02   pass
18000  18008  4      2.431      1599.5280 2.774894e-10 3.038850e-02   pass
18000  18008  4      2.430      1600.4747 2.774894e-10 3.038850e-02   pass
20000  20016  4      3.459      1542.0288 3.665729e-10 3.244973e-02   pass
20000  20016  4      3.459      1541.9453 3.665729e-10 3.244973e-02   pass
22000  22008  4      4.408      1610.6650 4.682967e-10 3.430089e-02   pass
25000  25000  4      6.509      1600.4551 5.435008e-10 3.090695e-02   pass
26000  26000  4      7.131      1643.3722 5.904530e-10 3.104779e-02   pass
27000  27000  4      7.888      1663.8254 6.503383e-10 3.171380e-02   pass
30000  30000  1      10.731     1677.5287 8.712018e-10 3.434286e-02   pass
110016 110016 1      504.572    1759.4013 1.061083e-08 3.135573e-02   pass

Performance Summary (GFlops)

Size   LDA    Align.  Average  Maximal
10000  10000  4       1473.0159 1482.7231
15000  15000  4       1611.2501 1613.5717
18000  18008  4       1600.0013 1600.4747
20000  20016  4       1541.9871 1542.0288
22000  22008  4       1610.6650 1610.6650
25000  25000  4       1600.4551 1600.4551
26000  26000  4       1643.3722 1643.3722
27000  27000  4       1663.8254 1663.8254
30000  30000  1       1677.5287 1677.5287
110016 110016 1       1759.4013 1759.4013

Residual checks PASSED

End of tests
kinghorn@utest:~/projects$ sudo cpupower monitor -m Mperf | sort -k2 -r
CPU | C0   | Cx   | Freq
  39|  0.99| 99.01|  3790
  28|  0.47| 99.53|  3789
  34|  0.14| 99.86|  3789
  41|  0.07| 99.93|  3787
  31|  0.06| 99.94|  3789
  29|  0.06| 99.94|  3786
  36|  0.05| 99.95|  3787
  52|  0.03| 99.97|  3786
  30|  0.02| 99.98|  3785
  35|  0.02| 99.98|  3780
  32|  0.02| 99.98|  3777
  51|  0.01| 99.99|  3846
  37|  0.01| 99.99|  3784
  42|  0.01| 99.99|  3781
  49|  0.01| 99.99|  3775
  33|  0.01| 99.99|  3774
  55|  0.01| 99.99|  3767
  54|  0.01| 99.99|  3703
  48|  0.00|100.00|  3756
  40|  0.00|100.00|  3701
  53|  0.00|100.00|  3640
  46|  0.00|100.00|  3607
  44|  0.00|100.00|  3577
  45|  0.00|100.00|  3524
  47|  0.00|100.00|  3515
  43|  0.00|100.00|  3510
  50|  0.00|100.00|  3500
  38|  0.00|100.00|  3448
    |Mperf               

kinghorn@utest:~/projects$ sudo cpupower monitor -m Mperf | sort -k2 -r
CPU | C0   | Cx   | Freq
  28|  2.98| 97.02|  2792
  39|  0.80| 99.20|  2792
  34|  0.31| 99.69|  2792
  44|  0.16| 99.84|  2792
  41|  0.16| 99.84|  2791
  36|  0.11| 99.89|  2793
  29|  0.10| 99.90|  2788
  52|  0.09| 99.91|  2794
  42|  0.09| 99.91|  2785
  30|  0.07| 99.93|  2787
  45|  0.06| 99.94|  2787
  46|  0.06| 99.94|  2783
  37|  0.04| 99.96|  3264
  35|  0.04| 99.96|  2799
  50|  0.03| 99.97|  2796
  32|  0.03| 99.97|  2766
  53|  0.02| 99.98|  2824
  48|  0.02| 99.98|  2806
  51|  0.02| 99.98|  2787
  31|  0.02| 99.98|  2784
  54|  0.02| 99.98|  2777
  43|  0.02| 99.98|  2775
  55|  0.01| 99.99|  2808
  40|  0.01| 99.99|  2754
  38|  0.01| 99.99|  2751
  33|  0.01| 99.99|  2733
  49|  0.00|100.00|  2784
  47|  0.00|100.00|  2765
    |Mperf               

Tags: Intel, i9 9990XE, Xeon W-3175, i9 9900K, i7 9800X, Xeon 2145W, RTX 2080Ti, Linpack, NAMD
lemans24

Don, great info regarding the w3175x but...

I don't come to the same conclusion as you regarding this as not being viable for DIY builders.
The board you tested and the chip will soon be available on newegg and amazon.
I think that Intel maybe rushing the introduction but I definitely see this a a viable high workstation solution.
It will be at least twice as expensive as a comparable threadripper system but in my profession time really is money.
Access to 192GB ECC memory with cpu running beyond 4.0 GHz from a large single die which would definitely at least have
great IO performance as well.

Will see by the fall as I am sure cheaper boards will be out by then and once we see a Supermicro board
then this will definitely be well supported by Intel...

On the other hand, a threadripper 3000 system sure may trump this chip by the summer!!!

Posted on 2019-03-04 12:42:23
Donald Kinghorn

you are right of course :-) The W-3175 is pretty impressive. Also, if you consider that the Xeon 8180 28-core costs over $10000 it starts to look like a bargain. The problem is Intel's iffy commitment to it and inconsistent availability. It's not a viable product for Puget Systems ...in its current state ... but ... if everything shows up on newegg then a DIY build IS a viable proposition. In your case this thing could be great!

It would not be a simple build and it would be unpleasant to have in the same room because of the fan noise on the radiator ( that puppy is hot when under load). It would probably be an $8-9K project and it could induce a few headaches along the way ... but it would be a killer box! I was definitely conflicted in my feelings about it!

The other consideration you also mention, new hardware is on its way! There should be some interesting hardware coming out this year. That includes Intel and AMD . Intel is in a funny spot right now. They got caught short on processors and seem to be salvaging every piece of silicon they can. I expect that things will get a lot better for them as the year progresses.

Posted on 2019-03-05 00:43:51
lemans24

I looked at the Gigabyte c621 Aorus Extreme motherboard and it looks to be a much better design as a high end workstation motherboard as compared to the Asus w3175x motherboard you were able to use. Definitely is a gamer board in name only as this is the first gamer board that I know of that has embedded 2 SATA disk on modules!! Both of these motherboards make no sense as overclocked gaming machines but would be absolutely great as workstations as the vrm designs are so over the top that this board should be rock steady at speeds slightly above 4.0 ghz. If Intel truly makes the w3175x available as a retail chip, this would be a great upgrade from my 1950x threadripper system. Looking forward to 2nd half of the year once threadripper 3000 arrives and see how Intel will react!!

Either way, it looks like I need to run a second machine with 4 gpu cards for real time options trading running my Monte Carlo simulations as my processing is not able to keep up consistently once the market goes crazy which is exactly when I want to run my simulations...can't win but I will continue to optimize and build another GPU server anyway.

Keep up the good work especially with HPC columns as those are usually the columns that are of greatest help to me...

Posted on 2019-03-11 00:02:34
lemans24

Just another note on where Intel is going regarding the w3175x:
Another board is being released by summer which indicates intel is probably serious about bringing about w3175x (including derivatives for Cascade Lake) for high end workstations - https://www.tomshardware.co...

This motherboard really looks good as it is based on E-ATX motherboard size which fits many full size computer cases, single power supply requirement for basic overclocking and most importantly 4 pcie x16 slots for GPU cards and other accelerators. It only supports 6 memory dimm slots though but i dont think that is a major drawback. I/O options include 10g networking and hopefully dual 10G nics like Intel 550, dual U2 and dual 110mm M.2. I also would assume much quieter or no fans for cooling the VRM chips.

Definitely looks like this board should be cheaper than the Gigabyte/Asus boards and I am sure Intel will be bringing out an updated Cascade Lake w3175x version to compete with the summer introduction of threadripper zen 2.

Just waiting for Supermicro to come out with a w3175x workstation board and then will have serious decisions to make once threadripper zen 2 based motherboards come out which I am sure you will do comparison tests with...

Posted on 2019-03-20 13:12:48

Huh, that is interesting... but the pictures are making my head hurt. The dimensions of the PCB are greatly distorted in those images, throwing off the relative sizes of things like the CPU socket, RAM slots, and PCI-E slots. If they hadn't called out that it is supposed to be EATX size, I wouldn't have been able to tell :/

Posted on 2019-03-20 15:42:17
lemans24

This motherboard is definitely smaller than the Asus and Gigabyte w3175x motherboards by far...there are only 6 dimms on this board.

The size maybe hard to realize as this is a naked motherboard with no chips/sockets.
It looks like a really good layout and I wished someone would do a full e-atx workstation motherboard for threadripper too as this demands a full size case if you want to cool this chip properly anyway. There are no official threadripper workstation level motherboards on the market!!

Will keep looking out for w3175x motherboards as Intel did not release the w3175x out of the goodness of their hearts!! No successful mega-rich company releases products on purpose for flimsy/no reason...that makes no sense to me.
It looks to me that Intel is trying to compete with threadripper with their current 14nm design, see if they can make some money and then bring out some real guns once they perfect their 10nm architecture.

The ball is definitely in AMD's court until Intel gets their act together with the major drawback for AMD that Intel is STILL making money hand over fist with their chips...

Posted on 2019-03-20 17:17:22
pamodeo

Very interesting article!

I used NAMD quite a lot in the past, while currently I mostly run other MD codes (mainly AMBER package) that run 99% on GPU. However, even if NAMD now considerably benefits from GPU, it still depends on CPU (more precisely, on CPU/GPU balance), thus it surely represents a good "real world" benchmark both in the CPU-only and in the mixed CPU-GPU mode.

However, recently I'm intensively using quantum chemistry codes running either CPU-only or, at best, 80%CPU/20%GPU. Furthermore, they scale far from ideally even on multiple cores of a single WS, and are both RAM and disk I/O intensive, so they require a very good balance between CPU clock, number of cores, memory and disk size, memory and disk speed.

In this view, the 512GB RAM limit of the Intel Xeon W-3175 would represent the possible bottleneck for a WS based on this processor and the 128GB (official) RAM max represents the limit that excluded AMD Threadripper from my panel of possible CPU candidates (some people from AMD stated in several blogs that LRDIMMs aren't PHYSICALLY supported by Threadripper-class CPUs: does anybody have information about the truth of this statement?). Thus I am also exploring solutions based on the AMD Epyc 7173 CPU. However, the availability of both Intel Xeon W-3175 and AMD Epyc 7173 processors is at present quite problematic.

As for the Intel Xeon W-3175 platform you used for this test, the Asus ROG Dominus Extreme MB officially only supports "12 x DIMM, Max. 192GB, DDR4 4200(O.C.)/4000(O.C.)/3800(O.C.)/3733(O.C.)/3600(O.C.)/3466(O.C.)/3400(O.C.)/3200(O.C.)/3000(O.C.)/2933(O.C.)/2800(O.C.)/2666/2400/2133 MHz ECC and non-ECC, Un-buffered Memory", while in your setup you reported "256GB DDR4-2666 Reg ECC Memory".

So, can I safely assume that this MB fully supports LRDIMMs, or was the MB you used an Asus "test platform" customized version?

Posted on 2019-03-27 14:42:15
lemans24

I think the description up above for w3175x ram capacity for the Asus ROG Dominus Extreme should be 192GB for both ECC and Non ECC Unbuffered ram.
Does not look like this motherboard will support buffered ECC ram which is what Registered/LRDIMMs support.

As to your next point, I too agree that there should be more tests that include multi-tasking that loads the processor with not just cache constrained tasks but also I/O constrained tasks like databases queries, file reading/writing and video processing all running randomly at the same time like how I run my applications 24 hours a day. I have concerns that while threadripper maybe good for single processes that are well threaded with each task mainly running within the l2 cache, Intel w3175x may deliver better throughput for real time tasks that need to be run continuously through out the day with other memory intensive tasks running too. I do real time options trading on a single PC and I think thread ripper will be swamped if I start trading more than a few options at a time but I don't have an Intel machine to do a fair comparison with. Most of the test results I have read online regarding the w3175x look very good the more tasks you run that are different as compared to a single well threaded application process on threadripper...

Posted on 2019-03-28 12:04:15
pamodeo

Thank you for your kind reply!
The description of the HW setup used in this test includes "256GB DDR4-2666 Reg ECC Memory", so, apparently Registered Memory IS supported by this MB.

After reading other compared tests between Threadripper/Xeon/i9 CPUs, I fully agree with your analysis.
My big problem is that the kind of calculations I will mainly run on my new WS are both CPU and RAM and disk i/o intensive. My typical usage should be running 1 to 4 concurrent jobs using 4 to 16 cores each.

At present, a dual AMD Epyc 7173, also exhibiting a higher memory expansibility, is less expensive than a single Xeon W-3175 solution. Proper distribution of GPUs and M.2/U.2 SSD devices should also ensure an optimal usage of resources for up to 2 CPU and 1/2 mainly-GPU simultaneous jobs.

Obviously, a big AMD/Intel difference in performances could arise for codes available both with and without AVX-512 support. Unfortunately, the clock limitations associated to a heavy use of AVX-512 instructions make very difficult and application- (or even calculation-) dependent any evaluation of their real advantage.

All in all, and also considering the scarce-to-null availability of these CPUs and, consequently, of test platforms to run specific comparisons on the programs and calculations of interest, the choice of an architecture and the design of a balanced HW
setup is definitively not a trivial job, especially if a reasonable
performance-to-cost ratio is an additional constraint.

Posted on 2019-03-28 13:56:22
Liu Siyan

Hi Pamodeo,
Do you have more evidence for the 256GB even 512GB DDR4-2666 Reg ECC RAM compatible with the motherboard and W3175x? I checked the Qualified vendors List of compatible memory and also asked the technique support of ASUS, they said the maximum RAM support is 192Gb (16GB*12). I'm considering build a workstation with 256GB or more RAM for scientific computing, the W-3175x looks impressive.

Posted on 2019-04-18 16:53:54
pamodeo

Hi Liu Siyan,
no, indeed this page is the only place so far where a configuration with more than 192GB RAM (and using LRDIMM) has been described and tested. In this view, the only viable options to build workstations with >=256GB RAM presently seem to be based on either server Xeon or Epic CPUs. This is rather annoying if budget and/or clock frequencies are among your main concerns, but I couldn't find any alternative to these two CPU families. What makes this situation even more annoying is the next release of new families of both processors, although currently the race seems to be more oriented towards core numbers than clock frequencies. In this view, if avx512 is poorly supported by your sw of interest and/or your calculations scale poorly after about 8-16 cores, single or dual epyc 7371 could represent a reasonable compromise between cost, performances and availability on a reasonable timescale.

Posted on 2019-04-18 19:30:42
Donald Kinghorn

I messed up! It was 192GB non-ECC ... This board is made for over clocking! I think I may have made a cut-and-paste error!

The processor supports up to 512GB according to Intel's specs and that would mean Reg modules ... but not on this ASUS board

This was a testing sample board and I think this one did(??) support Reg DIMM's (it's a C621 chipset) ... I thought I had 32GB Reg modules in there but checking back it was 12 x 16GB I made the correction in the post. I hope I didn't mess anyone up with this!

Posted on 2019-04-19 18:31:18
Donald Kinghorn

see my reply below or my new post comment ... I messed you up on this! That system had 192GB Un-buffered mem in it! Really sorry about that!
We are getting in another (unannounced) board for testing but I'm not optimistic that it will be anything other than an overclocker rig too.

I think the best thing coming up soon will be the Cascade Lake Xeon's (2nd gen Scalable) There should be some really nice single socket configs for this and I'm hoping for reasonable pricing. We don't have it in hand yet so I can't say for sure, but I'm expecting some really nice platforms.

Posted on 2019-04-19 18:45:32
Donald Kinghorn

Hi Liu, Sorry about the confusion on the memory. I did stop by labs and tried to fire that board up again to verify the buffered vs un-buffered issue. (we were thinking that we did get to start up with Reg mem) We had torn the test-bed down and unfortunately that board is a big pain to start up! It takes 2 power supplies and is one of the most finicky motherboards I've seen. ...could not get it to even POST with either Reg DIMMS or Un-buffered! The guy that had gotten it working (after most of a day) was on vacation. I could only spend about an hour messing with it ... however, take that struggle as a warning about considering it. It's really seems just to be a toy for over-clockers.

I'm not sure exactly when we will get the testing samples for the Cascade Lake single socket Xeon but it looks like it should be really nice! I'm definitely looking forward to testing that as a serious scientific workstation platform.
Best regards --Don

Posted on 2019-04-22 16:47:43
Donald Kinghorn

I think you've gotten some good reply about the memory. Amber is, as you have seen, very optimized for GPU (I don't test it because of the licensing). Most of the MD codes are just getting non-bonded forces off loaded to GPU. This is where a lot of the balancing comes in. Over the years CPU has not really kept up with the GPU performance. What I have in this post are probably the best results I've seen and are right up there with the Threadripper.

We will surely be seeing some more interesting high end CPU offerings over the next few months. By the end of the year things should get really interesting. Intel is very active right now working on new hardware (new designs) Their software teams look to be doing the right things too. I'm looking forward to see what they come up with. The focus will likely be on ML/AI optimized devices but this is a good thing in my opinion.

I hope to see a Renaissance of scientific computing based on the work that is being done with the machine/deep learning frameworks. The difficulty will be the large mass of legacy code. Porting to new devices and frameworks will difficult but I believe new researchers will realize how powerful the frameworks are in general and write new code.

In any case :-) This summer should see some more great CPU hardware! I also, expect that supply issues and some of the craziness we have been seeing will settle down.

Posted on 2019-03-28 16:03:34
pamodeo

Your reply is clearly a case of "mind reading"... no more than 10 minutes before your message, I decided for an interim solution, delaying the purchase of the WS/server intended for heavy calculations until the new AMD/Intel CPUs will be available and tested. Right now, I will buy a PC with optimal performance/cost ratio and either i9 or Threadripper CPU + 128GB and GPUs. This machine will cover the full (Amber) MD and moderate size/complexity QM calculations of . In this way, I'll be able to test the performance of the different SSD solutions used as high speed scratch devices. If required by budget limitations, most costly components, such as the GPUs and M.2/U.2 SSDs, will be eventually transferred to the new WS/server and this computer will be downgraded to a desktop PC for data analysis.

Amber GPU-optimized code (pmemd) is unfortunately not free, while the CPU-only code (sander) is optimized for features rather than performance. In this view, NAMD surely represents a more reliable and versatile benchmark.

As for QM code, I presently use ORCA, which is freely-available for non-profit uses, but is only distributed in binary form. Similarly to most QM programs, CPU core scaling, memory and disk i/o usage heavily depend on the specific type of calculation, in addition with on system size and setup. In general, except for ideally-scaling modules (numerical frequencies calculations, using one core for each coordinate shift), the scaling is quite poor, thus only very long runs or extreme hurry can justify the use of more than 8-16 cores.

However, there is a whole bunch of other freely-available QM programs that still heavily rely on CPU and are (or can be) also memory- and/or disk I/O intensive and possibly may scale better with CPU cores. In my opinion, ORCA and/or one of these latter could usefully complement NAMD and the other benchmark used in your tests, especially to cover the "dark region" between powerful WS and servers, where CPU clocks are more important than core numbers (beyond 16 or like), and memory/disk sizes and performances all matter.

I want to compliment you again for these very interesting and useful tests on high-performance HW.

Posted on 2019-03-28 18:03:49
lemans24

Good to know that you will wait till the later half of this year to buy your real "heavy calculations" server as this is exactly the same position I will be taking. As much as the GPU are extremely capable of massive calculations, if you are doing any real time processing based on external triggers, you need a processor that can do both great process multi-tasking and core mutli-threading. Threadripper does have a great price/performance ratio for specialized programs/libraries but I really am on the fence for custom programs that push the computer to the max for huge workloads that fully exploit the L2 cache, memory, file i/o pcie cards i/o. For the kind of work that I do, my conclusion is that threadripper does have latency issues as compared to the Intel architecture that affects muti-tasking the long you run you tasks. I have no exact data but my current threadripper (1950x) does not seem to be much faster than the Intel i9 - 6850 based x99 motherboard i was initially running with except when I am compiling my applications. I will wait until we have some good reviews from Don i hope on both the threadripper 3000/epyc rome and Intel w3175x cacscade lake chips later in the year.

In the meantime, I am completely blown away running CUDA in my Titan XP vs 1080ti as, as the Titan Xp blows away the 1080ti the more I push my calculations during real time processing runs. The turing based cards look amazing and I hope to fully exploit them within the next few months as I want to do some massive GPU calculations for real-time options modelling and I have only scratched the surface. The caveat is that i hope I can make some real money ,options trading, to pay off all this hardware!! LOL

Posted on 2019-03-29 11:16:16
Hypersphere

I am confused about AVX512 and MD, which prompts some naive questions. If AVX512 causes the clock speed to decrease, would this be a disadvantage for MD performance? What would happen if AVX512 were disabled in these tests?

Posted on 2019-04-16 13:10:54
Donald Kinghorn

AVX512 and vectorization in general, can give tremendous boost to performance. But, the code has to take advantage with direct effort or by linking to something that uses it like MKL. NAMD uses Charm++ as it's "engine" it does not link to MKL and the vectorization performance is not optimal. It doesn't benefit from AVX512. Yasara makes some interesting comments, they say they use vectorization with AVX but then complain about the clock. AVX512 in particular runs at a lower clock and when active it lowers the core clock too in order to maintain power levels.

For the most part MD codes are doing mostly floating point matrix vector operation. That is what AVX is made for. But, for example, NAMD is designed to be optimal for massive parallel architecture. It's a standard package on supercomputers. There is always trade-offs in software design.

The 9920X is going to be great with or without AVX being utilized. You can go to a bit higher clock 8-core with the 9900K Coffee Lake (based on Haswell core) but I don't recommend it for a Scientific Workstation.

Software is complicated and modern hardware is very complicated. The newer Intel CPU's have like 5 different clocks. It used to be easy to predict performance but these days it hard to know for sure how a particular bit of software will respond. Your best bet is to go for a solid foundation platform like you have spec'd. You can be more risky and go with something like a high core AMD if you know that your program scales near linear in parallel with cores.

Disabling AVX in testing is a real good experiment. I did this once when I was compiling NAMD from source with Intel compilers and MKL. There was not much improvement by using the Intel compilers ... I ended up recommending using the binary distribution of NAMD rather than building from source

Posted on 2019-04-17 17:50:22
Hypersphere

Thanks again. It is likely that I will be mostly using YASARA, which benefits more from clock speed than number of cores. However, I might be increasing my use of NAMD, which appears to scale well with number of cores. It looks as though the i9-9920X will be a good middle ground between clock speed and number of cores. Moreover, much of the acceleration with both MD codes will come from the GPUs, and two RTX2080Ti 11GB cards should be ample for my configuration.

Posted on 2019-04-17 17:56:55
Donald Kinghorn

I messed up! I had listed the memory for the W-3175 test systems as 256GB Reg ECC... it was 192GB unbuffered ... This board is made for over clocking!

The processor supports up to 512GB according to Intel's specs and that would mean Reg modules ... but not on this ASUS board (or the Gigabyte board)

This was a testing sample board and I think this one did(??) have support Reg DIMM's enabled (it's a C621 chipset) ... I thought I had 32GB Reg modules in there but checking back it was fully loaded with 12 x 16GB I made the correction in the post. I hope I didn't mess anyone up with this!

Posted on 2019-04-19 18:36:38