I integrated your suggestions and some from the article you linked. Thank you, @suvash.
@stas Iām not very sure if this is a good suggestion: I use a āhackyā script to send me a notification whenever my GPU load < 10% using Telypth bot so that I can setup another train task incase a long run finished.
Thanks, @init_27. I am thinking more about about various programmatic solutions to identifying GPU states and needs. What users can do with that information once itās acquired is vast, so the latter type probably belong to the forums. Iām sure some people will find your script useful, therefore please donāt hesitate to share.
Got it.
Thanks @stas , hereās the code dump: I keep it running in a jupyter notebook in another tab. I could write a bash script but this is the lazy option:
import time
def get_usage():
a = !nvidia-smi --query-gpu=utilization.gpu --format=csv
return(int(a[1].replace("%","")))
def notify():
!telepyth -t <token_here> "GPU IDLE!
while(1):
if get_usage() < 10:
time.sleep(600)
if get_usage() < 10:
notify()
@stas Iāve realised that the only thing Iām watching the most is the GPU (core) usage % and GPU memory usage %. I have a little nvmon
script on my path, because I canāt remember/type out that command at all.
This could be a helpful little script (which in turn creates the nvmon
script) to sneak in the image creation or instance/os bootstrapping process.
#!/usr/bin/env bash
set -euo pipefail
# Nvidia monitor script
NVIDIA_MONITOR_SCRIPT="/usr/local/bin/nvmon"
echo "Writing the nvmon script at $NVIDIA_MONITOR_SCRIPT"
cat <<EOF | sudo tee $NVIDIA_MONITOR_SCRIPT
#!/usr/bin/env bash
nvidia-smi --query-gpu=pstate,utilization.gpu,utilization.memory --format=csv -l 1
EOF
sudo chmod +x $NVIDIA_MONITOR_SCRIPT
echo "$NVIDIA_MONITOR_SCRIPT is now copied in place"
Me too - which is exactly what nvidia-smi dmon
does, isnāt it?
yep, and a couple of more things.
but, Iām sure once my brain is trained to look at sm
and mem
columns, I can probably ignore everything else. Maybe I shouldnāt fight the dmon
and just learn to where to look at.
was just checking man nvidia-smi
and realised that I could just s elect the u tilization group to be monitored. Iāll try to remember that. No need for more wacky scripts then. nvidia-smi dmon -s u
I updated https://docs.fast.ai/dev/gpu.html with @suvash and @ecdridās contributions - thank you.
Please give the good link, I canāt see the page āSite not found
ā