Intel Xeon W-3175X and i9 9990XE Linpack and NAMD on Ubuntu 18.04

Introduction

I was able to get a little time in on the Intel Xeon W-3175X and the Core i9 9990XE processors. I ran a couple of numerical compute performance tests with the Intel MKL Linpack benchmark and NAMD. I used the same system image that I had used recently to look at 3 Intel 8-core processors so I will include those results here as well. There will be results for W-3175, 9990XE, 9800X, W-2145, and 9900K.

Intel has been doing some strange things recently. They are in shortage on many of their processors so we are seeing some models that would not normally appear in their line. For example some of the processors with integrated GPU’s that are coming out of the fab with faulty GPU sections, but perfectly good CPU sections, are being being released with the GPU disabled.

There are 2 recent Intel processors that are really strange, the Xeon W-3175X 28-core, and the Core i9 9990XE overclocked 14-core. I don’t know for sure if these processors are the result of fab “problems” or not. The W-3175X could be an overclocked Xeon Scalable 8180 that had problems rendering is useless in a multi-socket system and the 9990XE could be a 9980XE 18-core that had a few bad cores that when disabled allowed for overclocking of the remaining 14-cores.??? That’s pure speculation based only on my wild imagination!

Neither of these processors are actually available other than by an odd auction process in batches to OEM’s. The 9990XE does not have a warranty and I cannot find it on “Intel Ark”. The W-3175X is listed on Intel Ark. To me, the i9 9990XE does not appear to be a real product and I don’t understand why Intel would sell it to anyone without a warranty. They are not supporting it in any tangible way. The W-3175X at least “looks” like it might be a real product, but with no promise of availability or predictable pricing.

The W-3175X requires a special motherboard and cooler and surprisingly both ASUS and Gigabyte have made boards available. They are extremely large boards and they are using the Xeon Scalable C621 chipset.

We did get these two odd-balls in for testing at Puget Systems. My colleagues Matt and William did do a fair amount of testing with popular software programs running on Windows 10. You can find several of their posts listed in the Puget Systems “Articles” section. This one is particularly good for the 9990XE, Intels Core Xperiment i9 9990XE. I very much agree with the sentiment in this post that the 9990XE is just an experiment, it’s not a viable product. Also in practical terms the 3175 is not a viable product either.

Caveats aside, I was of course curious about the raw double precision floating point performance of these monsters so I popped into Puget labs and fired up Ubuntu 18.04 on them and ran the Intel optimized Linpack benchmark from MKL. That is in my opinion the best relative measure of numerical compute performance for Intel processors.


Processor Specs i9 9900K, i7 9800X, Xeon W-2145, i9 9990XE, Xeon W-3175

The following table list some of the specification differences between these processors relevant for consideration in a numerical computing workstation configuration.

i9 9900K, i7 9800X, Xeon 2145W, i9 9990XE, Xeon W-3175 Features

Features i9-9900K i7 9800X Xeon W-2145 i9 9990XE Xeon W-3175
Code Name Coffee Lake Skylake-X Skylake-W Skylake-X Skylake-W
Cores 8 8 8 14 28
Base Clock 3.6GHz 3.8GHz 3.7GHz 4.0GHz 3.1GHz
Max Turbo 5.0GHz 4.5GHz 4.5GHz 5.1GHz* 3.8GHz*
All Core 4.7GHz 4.1GHz 4.3GHz 5.0GHz* 3.7GHz*
Cache 16 MB 16.5 MB 11 MB 19.25 MB 38.5 MB
TDP 95 W 165 W 140 W 255W 255W
Max Mem 64 GB 128 GB 512 GB (Reg ECC) 128 GB 512 GB (Reg ECC)
Mem Channels 2 4 4 4 6
Max PCIe lanes 16 44 48 44 48
X16 GPU support 1 2 3 (4 w/PLX) 2 3 (4 w/PLX)*
Vector Unit AVX2 AVX512 AVX512 AVX512 AVX512
Price $500 $600 $1113 * $3000*

Notes:

Clock Frequencies: I will included some raw frequency monitoring output in an appendix. What I observed when running Linpack was this; For the 9990XE the job started with an initial frequency of 5.0GHz and it stayed there on all cores for the initialization of the job. When the AVX512 went under load the clock for all but 2 cores dropped to 3.1GHz. 2 cores remained near 5.0GHz. 3.1GHz is presumably the AVX clock frequency. For the job run with the W-3175 the initial clock was 4.3GHz then dropping to 3.7GHz all-core for the initialization and then dropping to 2.8GHz when AVX512 started.

PCIe: It is common for Xeon-W systems to support 2 or 3 X16 cards without a PLX switch. The motherboard we used had an X16,X8,X16,X8 layout.

Pricing: There is no official price for the 9990XE (it’s not a product) see Intels Core Xperiment i9 9990XE. The W-3175 is listed as a real product on Intel Ark. It has an MSRP of approx. $3000. It also requires a special (massive!) socket 3647 motherboard which would cost close to $2000 and a really good cooler. There are many details that make a system utilizing the W-3175 processor a non-viable product. I understand the temptation to think that you “want one of those” but really, it looks like it is not supportable as a product.


Hardware under test:

There were 4 platforms used in this testing.

  • Intel Core i9 9900K 3.6GHz 8-Core

    • Gigabyte Z390 Designare Motherboard (1 x X16 PCIe)
    • 64 GB DDR4-2666 Memory
    • 1 TB Intel 660p M.2 SSD
    • NVIDIA RTX 2080Ti
  • Intel Core i9 9990XE 5.0GHz 14-Core and Core i7 9800X 3.8GHz 8-Core

    • Gigabyte X299 Designare Motherboard (2 x X16 PCIe)
    • 128GB DDR4-2666 Memory
    • 1 TB Intel 660p M.2 SSD
    • NVIDIA RTX 2080Ti
  • Intel Xeon W-2145 3.7GHz 8-Core

    • Asus WS C422 SAGE/10G Motherboard (4 x X16 PCIe)
    • 256GB DDR4-2666 Reg ECC Memory
    • 1 TB Intel 660p M.2 SSD
    • NVIDIA RTX 2080Ti
  • Intel Xeon W-3175 3.1GHz 28-Core

Software:

I had the OS and applications installed on the Intel 660p M.2 drive and swapped it between the test systems.

I am running Linux for this testing but there is no reason to expect that the same types of workloads on Windows 10 would show any significant difference in performance.


Results

Linpack

An optimized Linpack benchmark can achieve near theoretical peak performance for double precision floating point on a CPU. It is the first benchmark I run on any new CPU’s. It is the benchmark (still) used to rank the Top500 supercomputers in the world. I feel it is the best performance indicator for numerical computation with maximally optimized software. The Intel optimized Linpack makes great use of the excellent MKL library. There are many programs that link to MKL for performance. This includes the very useful “numerical compute scripting” packages Anaconda Python and Mathworks MATLAB.

linpack chart

This is not necessarily a good selection of comparative results but hopefully it does give you idea of the relative performance. These are results utilizing the same test install system image and software versions.

The double precision floating point performance of the W-3175 is very impressive, as expected.

Note: These jobs ran with “real” threads since “Hyperthreads” are not useful for this calculation.

Note: The 8-core results are with a large problems size of 75000 simultaneous equations (a 75000 x 75000 “triangular solve”) and used approximately 44GB of system memory. The 9990XE and W-3275 were tested with a problem size of 110016 using approximately 94GB of system memory. Also, note that the 9900K has a disadvantage on this benchmark since it has the older AVX2 vector unit.

NAMD

I also tested with the Molecular Dynamics package NAMD. NAMD scales really well across multiple cores and it is not specifically optimized for Intel hardware. It is highly optimized code and it uses the very interesting Charm++ for it’s parallel capabilities. NAMD is an important program and I like it for testing since it is a good example of well optimized code that scales to massive numbers of processes and also has very good GPU acceleration that needs to be balanced by good CPU performance.

NAMD CPU

The AVX512 vector units are not that important for this code since it is designed to run well on a wide variety of hardware. Higher core counts are a big advantage for performance since NAMD has very good parallel scaling.

Note: These jobs ran with “Hyperthreads” since they help with the way NAMD uses threads. It is always worth experiment with Hyperthreads to see if they help or not.

Note: The performance units here are “days per nano-second” of simulation time. Adding a GPU will dramatically increase the performance as will be seen in the next chart.

NAMD GPU

The first thing to notice is that the performance has increased by over a factor of 10 by including the NVIDIA RTX 2080Ti!

Conclusions and Recommendations

I have to emphasize that the 9990XE and W-3175 processors are not really viable components for supportable products. They are more enthusiast curiosities than workstation components. This is especially true for the 9990XE, it has no support of any kind from Intel I don’t even know what they were thinking. The W-3175 is more interesting but it is still not viable as a product because of the lack of commitment and supply as well as the “extreme” nature of the overall system platform needed to run it. So, don’t even think about it!

On the positive side 2019 should be an interesting year for new hardware. We expect a new architecture design from Intel toward the end of the year (after a hardware security bug-fix refresh). The future platform should be a significant change over what we are using now including new chipsets supporting PCIe v4 among other niceties. We also expect the current supply issues to be resolved. Intel also, has other interesting hardware projects in the works and we may see some results from them for new compute accelerator hardware. And, that’s just Intel … AMD and ARM are looking really interesting too!

Happy computing –dbk

Appendix

9990XE raw data snippets

kinghorn@utest:~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              28
On-line CPU(s) list: 0-27
Thread(s) per core:  2
Core(s) per socket:  14
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Core(TM) i9-9990XE CPU @ 4.00GHz
Stepping:            4
CPU MHz:             1200.741
CPU max MHz:         5100.0000
CPU min MHz:         1200.0000
BogoMIPS:            8000.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            19712K
NUMA node0 CPU(s):   0-27
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
 dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon
  pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
   vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
    tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3
     cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase
      tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx
       smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc
        cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp
         hwp_pkg_req flush_l1d
kinghorn@utest:~/projects/benchmarks/linpack$ ./runme_xeon64

Current date/time: Fri Feb  8 11:35:49 2019

CPU frequency:    4.999 GHz
Number of CPUs: 1
Number of cores: 14
Number of threads: 14

Parameters are set to:

Number of tests: 1
Number of equations to solve (problem size) : 110016
Leading dimension of array                  : 110016
Number of trials to run                     : 1    
Data alignment value (in Kbytes)            : 1    

Maximum memory requested that can be used=96830363392, at the size=110016

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
110016 110016 1      910.742    974.7487 9.762934e-09 2.885014e-02   pass

Performance Summary (GFlops)

Size   LDA    Align.  Average  Maximal
110016 110016 1       974.7487 974.7487

Residual checks PASSED

End of tests

Start of job run, (showing active “hyperthreads”)

kinghorn@utest:~$ sudo cpupower monitor -m Mperf | sort -k2 -r
  24| 39.76| 60.24|  5009
  15|  0.64| 99.36|  5009
  14|  0.40| 99.60|  5009
  21|  0.13| 99.87|  5008
  20|  0.12| 99.88|  5010
  19|  0.12| 99.88|  5009
  16|  0.12| 99.88|  5006
  18|  0.10| 99.90|  5016
  25|  0.10| 99.90|  5008
  17|  0.07| 99.93|  5005
  26|  0.06| 99.94|  5009
  27|  0.06| 99.94|  4997
  23|  0.05| 99.95|  5017
  22|  0.05| 99.95|  5005
    |Mperf|               

Frequencies during AVX512 load,

kinghorn@utest:~$ sudo cpupower monitor -m Mperf | sort -k2 -r
CPU | C0   | Cx   | Freq
  14|  1.17| 98.83|  3116
  24|  0.77| 99.23|  3104
  23|  0.21| 99.79|  3479
  19|  0.12| 99.88|  3106
  25|  0.10| 99.90|  3221
  17|  0.10| 99.90|  3097
  21|  0.09| 99.91|  3144
  26|  0.08| 99.92|  5007
  18|  0.08| 99.92|  3101
  20|  0.07| 99.93|  3106
  22|  0.06| 99.94|  3211
  15|  0.04| 99.96|  3100
  27|  0.03| 99.97|  4871
  16|  0.00|100.00|  3072
    |Mperf               

W-3175 data snippets

from /proc/cpuinfo

processor	: 55
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) W-3175X CPU @ 3.10GHz
stepping	: 4
microcode	: 0x2000059
cpu MHz		: 3800.392
cache size	: 39424 KB
physical id	: 0
siblings	: 56
core id		: 30
cpu cores	: 28
apicid		: 61
initial apicid	: 61
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx
 fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
  rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est
   tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes
    xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti
     ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2
      smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb
       intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc
        cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_pkg_req pku ospke
         flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6200.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:
kinghorn@utest:~/projects/benchmarks/linpack$ ./runme_xeon64

Current date/time: Fri Feb 15 05:17:29 2019

CPU frequency:    4.289 GHz
Number of CPUs: 1
Number of cores: 28
Number of threads: 28

Parameters are set to:

Number of tests: 10
Number of equations to solve (problem size) : 10000 15000 18000 20000 22000 25000 26000 27000 30000 110016
Leading dimension of array                  : 10000 15000 18008 20016 22008 25000 26000 27000 30000 110016
Number of trials to run                     : 2     2     2     2     1     1     1     1     1     1    
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     1     1    

Maximum memory requested that can be used=96830363392, at the size=110016

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
10000  10000  4      0.456      1463.3087 1.051521e-10 3.707768e-02   pass
10000  10000  4      0.450      1482.7231 1.051521e-10 3.707768e-02   pass
15000  15000  4      1.399      1608.9285 2.253401e-10 3.549145e-02   pass
15000  15000  4      1.395      1613.5717 2.253401e-10 3.549145e-02   pass
18000  18008  4      2.431      1599.5280 2.774894e-10 3.038850e-02   pass
18000  18008  4      2.430      1600.4747 2.774894e-10 3.038850e-02   pass
20000  20016  4      3.459      1542.0288 3.665729e-10 3.244973e-02   pass
20000  20016  4      3.459      1541.9453 3.665729e-10 3.244973e-02   pass
22000  22008  4      4.408      1610.6650 4.682967e-10 3.430089e-02   pass
25000  25000  4      6.509      1600.4551 5.435008e-10 3.090695e-02   pass
26000  26000  4      7.131      1643.3722 5.904530e-10 3.104779e-02   pass
27000  27000  4      7.888      1663.8254 6.503383e-10 3.171380e-02   pass
30000  30000  1      10.731     1677.5287 8.712018e-10 3.434286e-02   pass
110016 110016 1      504.572    1759.4013 1.061083e-08 3.135573e-02   pass

Performance Summary (GFlops)

Size   LDA    Align.  Average  Maximal
10000  10000  4       1473.0159 1482.7231
15000  15000  4       1611.2501 1613.5717
18000  18008  4       1600.0013 1600.4747
20000  20016  4       1541.9871 1542.0288
22000  22008  4       1610.6650 1610.6650
25000  25000  4       1600.4551 1600.4551
26000  26000  4       1643.3722 1643.3722
27000  27000  4       1663.8254 1663.8254
30000  30000  1       1677.5287 1677.5287
110016 110016 1       1759.4013 1759.4013

Residual checks PASSED

End of tests
kinghorn@utest:~/projects$ sudo cpupower monitor -m Mperf | sort -k2 -r
CPU | C0   | Cx   | Freq
  39|  0.99| 99.01|  3790
  28|  0.47| 99.53|  3789
  34|  0.14| 99.86|  3789
  41|  0.07| 99.93|  3787
  31|  0.06| 99.94|  3789
  29|  0.06| 99.94|  3786
  36|  0.05| 99.95|  3787
  52|  0.03| 99.97|  3786
  30|  0.02| 99.98|  3785
  35|  0.02| 99.98|  3780
  32|  0.02| 99.98|  3777
  51|  0.01| 99.99|  3846
  37|  0.01| 99.99|  3784
  42|  0.01| 99.99|  3781
  49|  0.01| 99.99|  3775
  33|  0.01| 99.99|  3774
  55|  0.01| 99.99|  3767
  54|  0.01| 99.99|  3703
  48|  0.00|100.00|  3756
  40|  0.00|100.00|  3701
  53|  0.00|100.00|  3640
  46|  0.00|100.00|  3607
  44|  0.00|100.00|  3577
  45|  0.00|100.00|  3524
  47|  0.00|100.00|  3515
  43|  0.00|100.00|  3510
  50|  0.00|100.00|  3500
  38|  0.00|100.00|  3448
    |Mperf               

kinghorn@utest:~/projects$ sudo cpupower monitor -m Mperf | sort -k2 -r
CPU | C0   | Cx   | Freq
  28|  2.98| 97.02|  2792
  39|  0.80| 99.20|  2792
  34|  0.31| 99.69|  2792
  44|  0.16| 99.84|  2792
  41|  0.16| 99.84|  2791
  36|  0.11| 99.89|  2793
  29|  0.10| 99.90|  2788
  52|  0.09| 99.91|  2794
  42|  0.09| 99.91|  2785
  30|  0.07| 99.93|  2787
  45|  0.06| 99.94|  2787
  46|  0.06| 99.94|  2783
  37|  0.04| 99.96|  3264
  35|  0.04| 99.96|  2799
  50|  0.03| 99.97|  2796
  32|  0.03| 99.97|  2766
  53|  0.02| 99.98|  2824
  48|  0.02| 99.98|  2806
  51|  0.02| 99.98|  2787
  31|  0.02| 99.98|  2784
  54|  0.02| 99.98|  2777
  43|  0.02| 99.98|  2775
  55|  0.01| 99.99|  2808
  40|  0.01| 99.99|  2754
  38|  0.01| 99.99|  2751
  33|  0.01| 99.99|  2733
  49|  0.00|100.00|  2784
  47|  0.00|100.00|  2765
    |Mperf