Show_install(0) Cuda Issues

I like having the formatting. I just would maybe add ```text to the starting one since it isn’t really code and that formats it better, but just having that information in general is awesome.

yes, sorry, I only saw what you meant when I pasted it myself :wink: I added ```text. Thank you for a great suggestion, @KevinB. Let’s hope people will copy-n-paste it too :wink:

1 Like

Probably better to use:

nvidia-smi dmon
3 Likes

I didn’t know of that one, thank you. Excellent for watching memory consumption!

But you can’t see the processes there. So I suppose both are useful.

I made a summary of this discussion here: http://docs-dev.fast.ai/troubleshoot#am-i-using-my-gpus

4 Likes

this might be useful too.

6 Likes

This is great, Suvash! Keep those suggestions coming, I will be compiling them together at http://docs-dev.fast.ai/

I started a new document https://docs-dev.fast.ai/gpu.html to collect gpu-related tips, so if you have other suggestions please send them my way. Thanks.

3 Likes

adding the utilization metrics (utilization.gpu,utilization.memory) is also a good idea.

nvidia-smi --query-gpu=timestamp,pstate,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 5

and you can get more info by

nvidia-smi --help-query-gpu

More information available here.
https://nvidia.custhelp.com/app/answers/detail/a_id/3751/~/useful-nvidia-smi-queries

1 Like

I integrated your suggestions and some from the article you linked. Thank you, @suvash.

1 Like

@stas I’m not very sure if this is a good suggestion: I use a “hacky” script to send me a notification whenever my GPU load < 10% using Telypth bot so that I can setup another train task incase a long run finished.

Thanks, @init_27. I am thinking more about about various programmatic solutions to identifying GPU states and needs. What users can do with that information once it’s acquired is vast, so the latter type probably belong to the forums. I’m sure some people will find your script useful, therefore please don’t hesitate to share.

Got it.

Thanks @stas , here’s the code dump: I keep it running in a jupyter notebook in another tab. I could write a bash script but this is the lazy option:

import time
def get_usage():
    a = !nvidia-smi --query-gpu=utilization.gpu --format=csv
    return(int(a[1].replace("%","")))
    

def notify():
    !telepyth -t <token_here> "GPU IDLE!


while(1):
    if get_usage() < 10:
        time.sleep(600)
        if get_usage() < 10:
            notify()
1 Like

There’s a Python library also named gpustat which pretty much does the same …

3 Likes

@stas I’ve realised that the only thing I’m watching the most is the GPU (core) usage % and GPU memory usage %. I have a little nvmon script on my path, because I can’t remember/type out that command at all.

This could be a helpful little script (which in turn creates the nvmon script) to sneak in the image creation or instance/os bootstrapping process.

#!/usr/bin/env bash

set -euo pipefail

# Nvidia monitor script
NVIDIA_MONITOR_SCRIPT="/usr/local/bin/nvmon"

echo "Writing the nvmon script at $NVIDIA_MONITOR_SCRIPT"

cat <<EOF | sudo tee $NVIDIA_MONITOR_SCRIPT
#!/usr/bin/env bash

nvidia-smi --query-gpu=pstate,utilization.gpu,utilization.memory --format=csv -l 1 
EOF

sudo chmod +x $NVIDIA_MONITOR_SCRIPT
echo "$NVIDIA_MONITOR_SCRIPT is now copied in place"
1 Like

Me too - which is exactly what nvidia-smi dmon does, isn’t it?

1 Like

yep, and a couple of more things.

but, I’m sure once my brain is trained to look at sm and mem columns, I can probably ignore everything else. Maybe I shouldn’t fight the dmon and just learn to where to look at.

was just checking man nvidia-smi and realised that I could just s elect the u tilization group to be monitored. I’ll try to remember that. No need for more wacky scripts then. nvidia-smi dmon -s u

1 Like

I updated https://docs.fast.ai/dev/gpu.html with @suvash and @ecdrid’s contributions - thank you.

5 Likes

Please give the good link, I can’t see the page “Site not found

https://docs.fast.ai/dev/gpu.html

Looks like they moved it to here.

1 Like