Table of Contents
I just confirmed that the RTX6000Ada performance issues were due to a bad motherboard on the test bed that was used. It’s still odd that it was just that MB plus 6000Ada that had any problems. We have confirmed that the Linux workloads I ran and the CPU rendering workloads on Windows are what we expect from this great GPU!
I was prompted to do some testing by a commenter on one of my recent posts, NVIDIA RTX4090 ML-AI and Scientific Computing Performance (Preliminary). They had concerns about problems with dual NVIDIA RTX4090s on AMD Threadripper Pro platforms.
They pointed out the following links,
- Parallel training with 4 cards 4090 cannot be performed on AMD 5975WX， stuck at the beginning
- Standard nVidia CUDA tests fail with dual RTX 4090 Linux box
- DDP training on RTX 4090 (ADA, cu118)
I ran some applications to reproduce the problems reported above and tried to dig deeper into the issues with more extensive testing. The included table below tells all!
Before we get to the table and testing of results and comments here are the hardware and software that were used for testing.
Test Hardware and Configurations
This is the most relevant component information.
- ASUS Pro WS WRX80E-SAGE SE WIFI
- BIOS 1003
- Ryzen Threadripper PRO 5995WX 64-Cores
- With GRUB_CMDLINE_LINUX=”amd_iommu=on iommu=pt”
- Supermicro X12SPA-TF
- BIOS 1.4a
- Xeon(R) W-3365 32-Cores
- NVIDIA GeForce RTX 4090
- NVIDIA RTX 6000 Ada Generation
- NVIDIA GeForce RTX 3090
- Ubuntu 22.04.1 LTS
- NVIDIA Driver Version: 525.85.05
- CUDA Version: 12.0
From NVIDIA CUDA 12 Samples
From NVIDIA NGC Containers
- TensorFlow 1.15 ResNet50
- HPL (FP 64 Linpack performs many times faster on NVIDIA Compute GPUs but I still like to run this benchmark on GeForce and Pro GPUs)
- PyTorch DDP
- NAMD 2.14 ApoA1
- PugetBench-minGPT (Based on Andrej Karpathy’s minGPT uses PyTorch DDP)
Problems With RTX4090 MultiGPU AMD vs Intel vs 6000Ada or RTX3090
|Test Jobs||2 x RTX4090 (TrPro)||2 x RTX4090 (Xeon W)||2 x RTX3090 (TrPro)||2 x RTX6000 Ada (TrPro)||2 x RTX6000 Ada (Xeon-W)|
|simpleP2P||Fail||Fail||NO P2P||YES P2P||YES P2P|
|p2pBandwidthLatencyTest||54 GB/s||54 GB/s||16.5 GB/s||51.1 GB/s||41.4 GB/s|
|TensorFlow 1.15||Hang||2131 img/s||2048 img/s||737 img/s||3832 img/s|
|NAMD ApoA1||0.01457 day/ns||N/A||0.01537 day/ns||0.02322 day/ns||0.01442 day/ns|
|HPL NGC||2246 GFLOPS||2225 GLOPS||1067 GFLOPS||567 GFLOPS||2567 GFLOPS|
|minGPT||Fail||Fail||123 sec||332 sec||101 sec|
There are 2 major problems;
- It looks like there is a “partially” broken P2P functionality with 2 x 4090. Some jobs that use that either fail or are corrupt.
- However, P2P is available and shows good GPU-GPU bandwidth.
PyTorch distributed data-parallel (DDP) corrupts or hangs. It finishes/returns on Xeon-W but there is no success verification.
- minGPT (also using DDP) corrupts and fails.
- Everything works as expected with 2 x 3090 on the AMD Tr Pro system.
- Everything works with 2 x 6000 Ada on both AMD Tr Pro and Intel Xeon-W systems but performance is very bad on the AMD Tr Pro system.
- All the issues with 2 x 4090 are present on the AMD Tr Pro system and also on the Xeon-W except for the TensorFlow ResNet job run.
- On TrPro TensorFlow 1.15 ResNet50 2×4090 (using NV NCCL) hangs, It runs fine on Xeon.
- Performance is very bad with 2 x 6000 Ada on TrPro (GPU clock stays at 629MHz and power usage is low).
**The tests in the table were mainly for functionality rather than performance. But, I did use higher performance input parameters on the 2×6000 Ada testing on Xeon. Optimal performance input parameters were NOT used on the dual RTX4090 and RTX3090 job runs. Don’t use this post as a performance comparison!
NVIDIA and AMD were made aware of these test results 1 week before this post was published, but have not yet replied, as of the time of this post.
I do not have workarounds or fixes for these problems! The performance issues on the AMD platform are particularly troubling and we will do more troubleshooting to see if a solution to the issues can be found.
I hope that publishing these results will make the issues with RTX4090 multi GPU and AMD WRX80 motherboards more visible to the public. I also hope NVIDIA and AMD will be prompted to address the reported problems.
My testing was on Linux however, we have also seen issues with some of our Windows testing that is consistent. In particular differences between behavior between 2 x RTX4090 and 2 x 6000 Ada.
If fixes or workarounds are found they will be posted back here as notes at the top of the page.
The Appendix provides more detail on a few of the job run failures.
Appendix Select job output excerpts and comments
On both the AMD and Intel test platforms
simpleP2P fails with a Verification error but
p2pBandwidthLatencyTest shows increased bandwidth with P2P enabled. For example on the AMD platform,
(cuda12-20.04)[email protected]:~/cuda-samples-12.0/bin/x86_64/linux/release$ ./simpleP2P [./simpleP2P] - Starting... Checking for multiple GPUs... CUDA-capable device count: 2 Checking GPU(s) for support of peer to peer memory access... > Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes > Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes Enabling peer access between GPU0 and GPU1... Allocating buffers (64MB on GPU0, GPU1 and CPU Host)... Creating event handles... cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 25.07GB/s Preparing host buffer and memcpy to GPU0... Run kernel on GPU1, taking source data from GPU0 and writing to GPU1... Run kernel on GPU0, taking source data from GPU1 and writing to GPU0... Copy data back to host from GPU0 and verify results... Verification error @ element 1: val = 0.000000, ref = 4.000000 Verification error @ element 2: val = 0.000000, ref = 8.000000 Verification error @ element 3: val = 0.000000, ref = 12.000000 Verification error @ element 4: val = 0.000000, ref = 16.000000 Verification error @ element 5: val = 0.000000, ref = 20.000000 Verification error @ element 6: val = 0.000000, ref = 24.000000 Verification error @ element 7: val = 0.000000, ref = 28.000000 Verification error @ element 8: val = 0.000000, ref = 32.000000 Verification error @ element 9: val = 0.000000, ref = 36.000000 Verification error @ element 10: val = 0.000000, ref = 40.000000 Verification error @ element 11: val = 0.000000, ref = 44.000000 Verification error @ element 12: val = 0.000000, ref = 48.000000 Disabling peer access... Shutting down... Test failed!
... Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 0 916.96 30.98 1 30.75 922.65 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 0 918.04 54.12 1 54.12 923.09 ...
minGPT fails with apparent data corruption on both AMD and Intel. This is using PyTorch DDP for multiGPU.
... RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Console screenshot on Tr Pro during HPL job run
This output clip shows the “stuck” GPU clock frequency and only a small fraction of the GPU power being used. on the AMD Tr Pro system.
!!! WARNING: Rank: 1 : trp64 : GPU 0000:61:00.0 Clock: 626 MHz Temp: 49 C Power: 70 W PCIe gen 4 x16 !!! WARNING: Rank: 0 : trp64 : GPU 0000:41:00.0 Clock: 626 MHz Temp: 50 C Power: 65 W PCIe gen 4 x16 !!! WARNING: Rank: 1 : trp64 : GPU 0000:61:00.0 Clock: 626 MHz Temp: 50 C Power: 70 W PCIe gen 4 x16 !!! WARNING: Rank: 0 : trp64 : GPU 0000:41:00.0 Clock: 626 MHz Temp: 50 C Power: 65 W PCIe gen 4 x16 Prog= 2.38% N_left= 71424 Time= 10.63 Time_left= 435.92 iGF= 557.23 GF= 557.23 iGF_per= 278.62 GF_per= 278.62 !!! WARNING: Rank: 0 : trp64 : GPU 0000:41:00.0 Clock: 626 MHz Temp: 51 C Power: 66 W PCIe gen 4 x16 Prog= 3.56% N_left= 71136 Time= 15.81 Time_left= 428.61 iGF= 565.41 GF= 559.91 iGF_per= 282.70 GF_per= 279.96 !!! WARNING: Rank: 1 : trp64 : GPU 0000:61:00.0 Clock: 626 MHz Temp: 50 C Power: 70 W PCIe gen 4 x16 !!! WARNING: Rank: 0 : trp64 : GPU 0000:41:00.0 Clock: 626 MHz Temp: 51 C Power: 65 W PCIe gen 4 x16 Prog= 4.72% N_left= 70848 Time= 20.85 Time_left= 420.54 iGF= 575.77 GF= 563.75 iGF_per= 287.89 GF_per= 281.87
Happy computing! –dbk @dbkinghorn