I started working on gpu memory utils.
My first need is to make sure I have 8GB free GPU RAM inside fastai_docs/docs_src/run_tests.sh
, since it will fail with less than that, and I don’t want to waste time/resources, and want the script to tell me if it can tell from the get going it’s not going to succeed.
Should these go into fastai/utils/mem.py
?
These functions I wrote so far on purpose don’t tap into pytorch’s memory maps, because I need those for new processes, so if there is a cached memory somewhere by an idle process it’s not going to work. I need to know the exact available memory not used by pytorch at all.
It’s first draft, so your input on naming, and in/out args are very welcome.
We will probably have a different set of util functions that will measure the memory of the currently running process via pytorch. So those 2 sets should have a distinct naming.
from enum import IntEnum
Memory = IntEnum('Memory', "USED, FREE, TOTAL", start=0)
# returns a list of mem available for each cpu
# [ [used-0, free-0, total-0], [used-1, free-1, total-1] ]
# this function assumes nvidia-smi works and will return [] if this is not the case
def get_gpu_mem():
"query nvidia-smi for used, free and total memory for each available gpu"
import subprocess
mem = []
try:
cmd = "nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv,nounits,noheader"
result = subprocess.run(cmd.split(), shell=False, check=False, stdout=subprocess.PIPE)
except: pass
else:
if result.returncode == 0 and result.stdout:
output = result.stdout.decode('utf-8')
mem = [[int(y) for y in x.split(', ')] for x in output.strip().split('\n') ]
#print(mem)
return mem
# return the gpu number that has the most memory, and the free memory
# return [] if no gpus were found
def get_gpu_with_max_free_mem():
mem = np.array(get_gpu_mem())
if not len(mem): return []
id = np.argmax(mem[:,Memory.FREE])
return (id, mem[id,Memory.FREE])
I temporarily put them inside collect_env.py
, so a test run on a single gpu box gives:
python -c "import fastai; print(fastai.utils.collect_env.get_gpu_mem(), fastai.utils.collect_env.get_gpu_with_max_free_mem())"
[[495, 7624, 8119]] (0, 7624)