Show_install(0) Cuda Issues

stas · October 23, 2018, 10:23pm

I integrated your suggestions and some from the article you linked. Thank you, @suvash.

init_27 · October 26, 2018, 1:39pm

@stas I’m not very sure if this is a good suggestion: I use a “hacky” script to send me a notification whenever my GPU load < 10% using Telypth bot so that I can setup another train task incase a long run finished.

stas · October 26, 2018, 4:58pm

Thanks, @init_27. I am thinking more about about various programmatic solutions to identifying GPU states and needs. What users can do with that information once it’s acquired is vast, so the latter type probably belong to the forums. I’m sure some people will find your script useful, therefore please don’t hesitate to share.

init_27 · October 26, 2018, 5:48pm

Got it.

Thanks @stas , here’s the code dump: I keep it running in a jupyter notebook in another tab. I could write a bash script but this is the lazy option:

import time
def get_usage():
    a = !nvidia-smi --query-gpu=utilization.gpu --format=csv
    return(int(a[1].replace("%","")))
    

def notify():
    !telepyth -t <token_here> "GPU IDLE!


while(1):
    if get_usage() < 10:
        time.sleep(600)
        if get_usage() < 10:
            notify()

ecdrid · October 27, 2018, 8:43am

There’s a Python library also named gpustat which pretty much does the same …

suvash · October 27, 2018, 4:30pm

@stas I’ve realised that the only thing I’m watching the most is the GPU (core) usage % and GPU memory usage %. I have a little nvmon script on my path, because I can’t remember/type out that command at all.

This could be a helpful little script (which in turn creates the nvmon script) to sneak in the image creation or instance/os bootstrapping process.

#!/usr/bin/env bash

set -euo pipefail

# Nvidia monitor script
NVIDIA_MONITOR_SCRIPT="/usr/local/bin/nvmon"

echo "Writing the nvmon script at $NVIDIA_MONITOR_SCRIPT"

cat <<EOF | sudo tee $NVIDIA_MONITOR_SCRIPT
#!/usr/bin/env bash

nvidia-smi --query-gpu=pstate,utilization.gpu,utilization.memory --format=csv -l 1 
EOF

sudo chmod +x $NVIDIA_MONITOR_SCRIPT
echo "$NVIDIA_MONITOR_SCRIPT is now copied in place"

jeremy · October 27, 2018, 4:58pm

Me too - which is exactly what nvidia-smi dmon does, isn’t it?

suvash · October 27, 2018, 5:12pm

yep, and a couple of more things.

but, I’m sure once my brain is trained to look at sm and mem columns, I can probably ignore everything else. Maybe I shouldn’t fight the dmon and just learn to where to look at.

was just checking man nvidia-smi and realised that I could just s elect the u tilization group to be monitored. I’ll try to remember that. No need for more wacky scripts then. nvidia-smi dmon -s u

stas · October 27, 2018, 7:20pm

I updated https://docs.fast.ai/dev/gpu.html with @suvash and @ecdrid’s contributions - thank you.

alessa · November 8, 2018, 12:08pm

Please give the good link, I can’t see the page “Site not found”

KevinB · November 8, 2018, 12:59pm

https://docs.fast.ai/dev/gpu.html

Looks like they moved it to here.