Table of Contents
This post presents preliminary ML-AI and Scientific application performance results comparing NVIDIA RTX 4090 and RTX 3090 GPUs. These are early results using the NVIDIA CUDA 11.8 driver.
The applications tested are not yet fully optimized for compute capability 8.9 i.e. sm89, which is the compute CUDA level for the Ada Lovelace architecture. That should be fully supported with CUDA 12. The testing will be revisited after further launches of the Lovelace arch series GPUs and rebuilt applications at this higher CUDA compute capability.
The RTX 4090 does not disappoint! NVIDIA has managed, yet again, to push performance improvements to nearly double over last generation. Those days are long gone on the CPU side and may be coming to an end for NVIDIA too, given CEO Jensen Huang's comment at GTC Fall 2022, “Moore’s Law is dead.” However, I suspect that they will still manage to impress with future generations. For now, the Ada Lovelace and Hopper architectures offer tremendous performance.
AMD Threadripper Pro Test Platform
- CPU AMD Threadripper PRO 5995WX 64 Core
- CPU Cooler Noctua NH-U14S TR4-SP3
- Motherboard Asus Pro WS WRX80E-SAGE SE WIFI
- RAM 8x DDR4-3200 16GB ECC Reg. (128GB total)
- NVIDIA GeForce RTX 4090 24GB
- NVIDIA GeForce RTX 3090 24GB
- Ubuntu 22.04 Linux
- NVIDIA Enroot 3.4
Containerized Applications From NVIDIA NGC
- HPL 2.3 High Performance Linpack tag: nvcr.io/nvidia/hpc-benchmarks:21.4-hpl
- HPCG 3.1 High Performance Conjugate Gradient solver tag: nvcr.io/nvidia/hpc-benchmarks:21.4-hpcg3.1
- NAMD 3.0a11 Molecular Dynamics tag: nvcr.io/hpc/namd:3.0-alpha11
- LAMMPS Molecular Dynamics tag: nvcr.io/hpc/lammps:patch_4May2022
- TensorFlow 1.15.5 ML/AI framework tag: nvcr.io/nvidia/tensorflow:22.09-tf1-py3
- PyTorch 1.13.0a0 ML/AI framework tag: nvcr.io/nvidia/pytorch:22.09-py3
The charts below show the excellent compute performance of the RTX 4090 in comparison with the also excellent RTX 3090. Keep in mind that these are preliminary results with application that are not fully optimized for the new compute level 8.9 for the Ada Lovelace architecture.
There will be a few details about the benchmarks and command lines that were used to run them. I am including some output for only RTX 4090 job runs, however, the 4090 and 3090 were both tested in the same system back to back, i.e. all results are new testing.
This is the HPL Linpack benchmark built to run on NVIDIA GPUs. It is intended to testing on the high-end compute GPUs like the A100 and H100. It is also setup for multi-GPU multi-node use. This is the standard benchmark used for ranking the Top500 supercomputers.
It is really not intended to be run on RTX GPUs! RTX GPU have very poor double precision (fp64) performance compared to compute GPUs. The single precision (fp32) performance is however excellent on RTX. There is a fp32 version of this benchmark named HPL-AI. Unfortunately I could not get it to properly converge with 1 or 2 RTX 4090s or 3090s.
I decided to include the fp64 HPL result because even though it is nearly 10 times lower performance than an A100 it is still respectable compared to a 16-24 core CPU! I feel that developers working on code that is intended for A100 or H100 could develop and test on a RTX 4090 before moving code to the (much more expensive) A100 or H100.
See Outstanding Performance of NVIDIA A100 PCIe on HPL HPL-AI HPCG Benchmarks for comparisons,
CUDA_VISIBLE_DEVICES=0 mpirun --mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc -np 1 hpl.sh --dat ./HPL.dat --cpu-affinity 0 --cpu-cores-per-rank 4 --gpu-affinity 0
2022-10-10 19:58:13.576 ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR03L2L2 48000 288 1 1 62.41 1.183e+03 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0045393 ...... PASSED ================================================================================
HPCG (High Performance Conjugate Gradient) is a memory-bound application, typical of many engineering "solvers". It is the secondary ranking benchmark of the Top500 supercomputer list. GPU results for this benchmark are typically much higher than even high end CPU due to the high performance memory on GPUs
mpirun --mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc -np 1 hpcg.sh --dat ./hpcg.dat --cpu-affinity 0 --cpu-cores-per-rank 4 --gpu-affinity 0
SpMV = 145.6 GF ( 917.2 GB/s Effective) 145.6 GF_per ( 917.2 GB/s Effective) SymGS = 181.1 GF (1397.8 GB/s Effective) 181.1 GF_per (1397.8 GB/s Effective) total = 171.4 GF (1299.6 GB/s Effective) 171.4 GF_per (1299.6 GB/s Effective) final = 162.2 GF (1230.1 GB/s Effective) 162.2 GF_per (1230.1 GB/s Effective)
This is the GPU resident version of the molecular dynamics package NAMD 3.0 alpha 11. Performance is very good!
See [Molecular Dynamics Benchmarks GPU Roundup GROMACS NAMD2 NAMD 3alpha on 12 GPUs for comparisons with the mixed CPU-GPU version 2,14
echo "CUDASOAintegrate on" >> v3-stmv.namd namd3 +p1 +setcpuaffinity +idlepoll +devices 0 v3-stmv.namd
4090 STMV output:
Info: Benchmark time: 1 CPUs 0.0104954 s/step 8.23218 ns/day 0 MB memory TIMING: 500 CPU: 6.17736, 0.0105293/step Wall: 6.18515, 0.0104489/step, 8.2688 ns/days, 0 hours remaining, 0.000000 MB of memory in use. ENERGY: 500 357236.6193 279300.1236 81947.6998 5091.0071 -4504684.2417 381649.5571 0.0000 0.0000 947494.7573 -2451964.4774 298.0115 -3399459.2347 -2443255.3272 297.9991 1256.7666 33.2244 10200288.8725 -42.3066 -11.3685
mpirun -n 1 lmp -k on g 1 -sf kk -pk kokkos cuda/aware on neigh full comm device binsize 2.8 -var x 8 -var y 8 -var z 8 -in in.lj
4090 lj output:
Loop time of 5.94622 on 1 procs for 100 steps with 16384000 atoms Performance: 7265.125 tau/day, 16.817 timesteps/s 100.0% CPU use with 1 MPI tasks x 1 OpenMP threads
TensorFlow 1.15.5 ResNet50
This is the NVIDIA maintained version 1 of TensorFlow which typically offers somewhat better performance than version 2. The benchmark is training 100 steps of the ResNet 50 layer convolution neural network (CNN). The result is the highest images-per-second value from the run steps. FP32 and FP16 (tensorcore) jobs were run.
CUDA_VISIBLE_DEVICES=0 python resnet.py --layers=50 --batch_size=128 --precision=fp16
PyTorch 1.13 Transformer training
This benchmark is using PyTorch 1.13 for a 6 epoch train of a Transformer model on Wikitext-2 with CUDA.
time CUDA_VISIBLE_DEVICES=0 python main.py --cuda --epochs 6 --model Transformer --lr 5 --batch_size 640
These preliminary compute results for the new NVIDIA RTX 4090 look very good and offer a significant performance boost over last generation RTX 3090 (which was already very good)! I expect results to be better with code compiled against CUDA 12 which will have full support for Ada Lovelace and Hopper arch GPUs. There will be more RTX 4000 series and Pro Axxx-ada series GPUs over the next few months. I will revisit testing.
One surprising result was how respectable the double precision floating point (fp64) was on RTX 4090. fp64 performance is not a highlighted feature of RTX GPUs. Single precision fp32 is typically 20 times fp64 for these GPUs. However, the fp64 performance of the RTX 4090 is competitive with 16-34 core CPUs. I feel this could be used for code testing and development that is target to run on high-end compute GPUs like A100 and H100.
Overall it looks like NVIDIA has kept the performance increase near doubling on successive GPU generations for another release. They have been doing this for over 10 years now. Impressive indeed!
Happy computing! –dbk @dbkinghorn
Puget Systems offers a range of powerful and reliable systems that are tailor-made for your unique workflow.
Why Choose Puget Systems?
Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.
We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!
By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry-leading ship time.