Puget Systems print logo
https://www.pugetsystems.com
Read this article at https://www.pugetsystems.com/guides/1983
Dr Donald Kinghorn (Scientific Computing Advisor )

Quad RTX3090 GPU Power Limiting with Systemd and Nvidia-smi

Written on November 24, 2020 by Dr Donald Kinghorn
Share:

TL;DR: You can run 4 RTX3090's in a system under heavy compute load using a single PSU and without without overloading your power line, as shown in a previous post. This can be done automatically at system boot time using a script with nvidia-smi commands and a startup service configured with Systemd. A script and systemd unit file are provided (and explained) below.

Introduction

This is a follow up post to "Quad RTX3090 GPU Wattage Limited "MaxQ" TensorFlow Performance". In that post I presented TensorFlow ResNet50 performance results over a range of GPU wattage limits. The goal there was to find a power limit that would give 95% of the total performance at a system power load that is acceptable with a single PSU running on a US 110V 15A power line. It turns out that limiting the RTX3090's to 270W or 280W does that! That means that it should be reasonable to setup a Quad RTX3090 system for machine learning workloads. Performance was outstanding!

In the testing in the post mentioned above I used the NVIDIA System Management Interface tool, nvidia-smi, to set GPU power limits in the testing scripts. This post will show you a way to have GPU power limits set automatically at boot by using a simple script and a systemd Unit file.

I used Ubuntu 20.04 server as the OS for the performance testing and for the startup service testing in this post. However, any modern Linux distribution using systemd should be OK.

nvidia-smi commands and a script to set a power limit on RTX30 GPUs

Here are the needed/useful nvidia-smi commands,

Persistence Mode

"When persistence mode is enabled the NVIDIA driver remains loaded even when no active clients, such as X11 or nvidia-smi, exist." (only available on Linux) This keeps the NVIDIA kernel modules from unloading. It is designed to lower job startup latency but I believe it is a good idea to set this on boot so that your power setting don't get accidentally "un-set" from a module reload.

sudo  nvidia-smi -pm 1 
or 
sudo nvidia-smi --persistence-mode=1

This should set all of the GPUs. You can use the "-i" flag to explicitly specify the GPUs by id. For example `-i 0,1,2,3` for the first 4 GPUs in the system.

Set GPU Power Limits

Setting the GPU power limit wattage can be done with, (Setting a 280W limit on the 350W default RTX3090 GPU as an example)

sudo nvidia-smi -pl 280 
or 
sudo nvidia-smi  --power-limit=280

After you have made changes you can monitor power usage during a job run with, ( "-q" query, "-d" display type, "-l 1" loop every 1 second )

nvidia-smi -q -d POWER -l 1 | grep "Power Draw"

Please see the NVIDIA nvidia-smi documentation for details. It's a very powerful and useful tool!

We will use a systemd Unit file to call a script to set GPU power limits at system startup. Here is a simple script to set the limits.

/usr/local/sbin/nv-power-limit.sh

#!/usr/bin/env bash

# Set power limits on all NVIDIA GPUs

# Make sure nvidia-smi exists 
command -v nvidia-smi &> /dev/null || { echo >&2 "nvidia-smi not found ... exiting."; exit 1; }

POWER_LIMIT=280
MAX_POWER_LIMIT=$(nvidia-smi -q -d POWER | grep 'Max Power Limit' | tr -s ' ' | cut -d ' ' -f 6)

if [[ ${POWER_LIMIT%.*}+0 -lt ${MAX_POWER_LIMIT%.*}+0 ]]; then
    /usr/bin/nvidia-smi --persistence-mode=1
    /usr/bin/nvidia-smi  --power-limit=${POWER_LIMIT}
else
    echo 'FAIL! POWER_LIMIT set above MAX_POWER_LIMIT ... '
    exit 1
fi

exit 0

I like to use the "/usr/local" directory hierarchy for my own added system level applications, libraries and config files. I placed the above power limit script in /usr/local/sbin/nv-power-limit-sh You will have to be root (use sudo) to write in that directory. File permissions are set with,

chmod 744 /usr/local/sbin/nv-power-limit.sh

root has read, write, and execute permissions and "group" and "other" have read permission. You only want root to be able to modify or run this script!

Systemd Unit file to start nv-power-limit.service at boot time

The following systemd unit file will be placed in /usr/local/etc/systemd That subdirectory may not exist, you can create it (as root) with,

sudo mkdir /usr/local/etc/systemd

/usr/local/etc/systemd/nv-power-limit.service

[Unit]
Description=NVIDIA GPU Set Power Limit
After=syslog.target systemd-modules-load.service
ConditionPathExists=/usr/bin/nvidia-smi

[Service]
User=root
Environment="PATH=/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"
ExecStart=/usr/local/sbin/nv-power-limit.sh

[Install]
WantedBy=multi-user.target

This file should have permissions set to 644 i.e. root has read and write permission and group and other have read permission.

sudo chmod 644 /usr/local/etc/systemd/nv-power-limit.service

With the unit service file in place we need to link it into the /etc/systemd/system directory so systemd can find it.

sudo ln -s /usr/local/etc/systemd/nv-power-limit.service /etc/systemd/system/nv-power-limit.service

Do an "ls -l /etc/systemd/service" and check that you got the link right.

After the power limit script is in place and the systemd unit file linked correctly you can check that it's working properly with,

sudo systemctl start nv-power-limit.service

and

sudo systemctl status nv-power-limit.service

With the system configuration used in this post "status" output looks like,

kinghorn@pslabs-ml1:~$ sudo systemctl status nv-power-limit.service
● nv-power-limit.service - NVIDIA GPU Set Power Limit
     Loaded: loaded (/usr/local/etc/systemd/nv-power-limit.service; linked; vendor preset: enabled)
     Active: inactive (dead)

Nov 23 16:11:25 pslabs-ml1 systemd[1]: Started NVIDIA GPU Set Power Limit.
Nov 23 16:11:27 pslabs-ml1 nv-power-limit.sh[14583]: Enabled persistence mode for GPU 00000000:53:00.0.
Nov 23 16:11:27 pslabs-ml1 nv-power-limit.sh[14583]: All done.
Nov 23 16:11:27 pslabs-ml1 nv-power-limit.sh[14587]: Power limit for GPU 00000000:53:00.0 was set to 280.00 W from 350.00 W.
Nov 23 16:11:27 pslabs-ml1 nv-power-limit.sh[14587]: All done.
Nov 23 16:11:27 pslabs-ml1 systemd[1]: nv-power-limit.service: Succeeded.

The last thing to do is to "enable" the service so that it will start at boot time.

sudo systemctl enable nv-power-limit.service

which should output,

Created symlink /etc/systemd/system/multi-user.target.wants/nv-power-limit.service → /usr/local/etc/systemd/nv-power-limit.service.

When you restart your system the GPUs should be set to the power limit you configured in nv-power-limit.sh. It would be good to double check with,

nvidia-smi -q -d POWER 

Conclusion

That's it! You should be able to run your Quad RTX3090 ML/AI rig with a single PSU and a reasonable power load.

If you try this be sure you understand the script and systemd unit file. Make changes as appropriate. I hope this post is helpful for everyone who is wanting to put 4 of these powerful RTX3090 to work. If you have suggestions on how things could be done better please put a note in the comments!

Happy computing! --dbk @dbkinghorn


Looking for a GPU Accelerated Workstation?

Puget Systems offers a range of poweful and reliable systems that are tailor-made for your unique workflow.

Configure a System!

Labs Consultation Service

Our Labs team is available to provide in-depth hardware recommendations based on your workflow.

Find Out More!

Why Choose Puget Systems?


Built specifically for you

Rather than getting a generic workstation, our systems are designed around your unique workflow and are optimized for the work you do every day.

Fast Build Times

By keeping inventory of our most popular parts, and maintaining a short supply line to parts we need, we are able to offer an industry leading ship time.

We're Here, Give Us a Call!

We make sure our representatives are as accessible as possible, by phone and email. At Puget Systems, you can actually talk to a real person!

Lifetime Support/Labor Warranty

Even when your parts warranty expires, we continue to answer your questions and even fix your computer with no labor costs.

Click here for even more reasons!

Puget Systems Hardware Partners

Tags: NVIDIA, TensorFlow, RTX30 series, Machine Learning
lemans24

Don, how are you doing??

I have a 1080ti and Titan Xp for running my monte carlo simulations and they works great.
Getting 2 3090 cards is crazy hard right now but it seems I can get at least a single non blower 3090.
Do you think the performance of a single 3090 would be better than my current above setup performance wise , which I then could use mainly for development??
Once i have optimized my code again under CUDA 11, I will get a real gpu server with quad 3090 blower cards for production use later in the year/next year once availability has settled down.
I have an 850 watt ps with an overclocked Threadripper 1950x and have no problem running the dual Pascal based cards which I then assume consumes more power than a single 3090 card.

Posted on 2021-01-30 00:07:19
Donald Kinghorn

Hey, good to hear from you :-) Still doing good here. Survived 2020 unscathed and hoping to do the same for 2021 :-)

The 3090 is a really nice card and yes, I think 1 will handily outperform both of your Pascal cards ... probably by a wide margin. Also, that 24GB mem on the card might give you some interesting possibilities.

The 3090 can draw 350W at peak load, which is crazy ... You should be OK with the 850 though.

NVIDIA is well aware of the availability problem and is seriously trying to remedy it, but 3090's will probably be hard to get for some time. We are doing pretty good getting supply, including the blower cards, but they go into new builds pretty quickly.

Having a 3090 for your CUDA 11 development work would be nice! Best wishes --Don

Posted on 2021-02-01 21:10:07
Delicious Points

Fascinating article. I myself have 9x 3090 Founder editions and their power limit is 280 watts across. I'm gonna have to try 270 as I've read some can maintain pretty much the same as 350 so this is a good start!

Posted on 2021-03-24 01:19:07
Donald Kinghorn

yes, 270 was still in the high 90's % It should also be possible to turn off the display part of the hardware but I haven't tested the effect of doing this,

GPU Operation Mode
GOM allows to reduce power usage and optimize GPU throughput by dis-
abling GPU features.
Each GOM is designed to meet specific user needs.
In "All On" mode everything is enabled and running at full speed.
The "Compute" mode is designed for running only compute tasks. Graphics
operations are not allowed.
The "Low Double Precision" mode is designed for running graphics appli-
cations that don’t require high bandwidth double precision.
GOM can be changed with the (--gom) flag.
Supported on GK110 M-class and X-class Tesla products from the Kepler
family. Not supported on Quadro and Tesla C-class products. Low Dou-
ble Precision and All On modes are the only modes available for sup-
ported GeForce Titan products.
Current The GOM currently in use.
Pending The GOM that will be used on the next reboot.

--gom=MODE
Set GPU Operation Mode: 0/ALL_ON, 1/COMPUTE, 2/LOW_DP Supported on
GK110 M-class and X-class Tesla products from the Kepler family. Not
supported on Quadro and Tesla C-class products. LOW_DP and ALL_ON are
the only modes supported on GeForce Titan devices. Requires adminis-
trator privileges. See GPU Operation Mode for more information about
GOM. GOM changes take effect after reboot. The reboot requirement
might be removed in the future. Compute only GOMs don’t support WDDM
(Windows Display Driver Model)

Posted on 2021-03-24 16:46:32
Jonathan Rux

For the bash script I changed the line
MAX_POWER_LIMIT=$(nvidia-smi -q -d POWER | grep 'Max Power Limit' | tr -s ' ' | cut -d ' ' -f 6)
to read
MAX_POWER_LIMIT=$(nvidia-smi -q -d POWER | grep 'Max Power Limit' | tr -s ' ' | cut -d ' ' -f 6 | cut -d '.' -f 1 | head -1)
When I ran this with 4 Quadro 8000s, it would pull assign the following to MAX_POWER_LIMIT:
270.00
270.00
270.00
270.00

My modification drops the point value and removes all but one. This allowed a single number to be stored in the variable.

Turns out 4 Quadros with barely enough space between them, being used to run Python get very warm. lol

Posted on 2021-05-06 23:59:32
Donald Kinghorn

Yes, that looks good ... got'a love bash script line hacking stuff :-)

They can indeed get toasty! It's nice that it's pretty straight forward to lower the power limit to help control those temps.

It seems like newer "2 slot" cards are tighter than they used to be. ?? Got to keep a good deal of overall case airflow going. One of my all time favorite cards was a Gigabyte 2080Ti with a blower and a small air ramp angled into the rear of the card. It helped nicely with getting air to the blower and between the cards when you had 4 of them. Wish manufacturers would be more mindful of little design details like that.

Posted on 2021-05-07 03:11:12