Understanding GPU memory usage

OK, I spent some time retooling it

  • updated it for 1.0.47 api

  • dropped the main experiment, wrapped in function, used only celllogger from ipyexperiments - that got a little bit more consistency, but not much

  • next I changed it not to re-use learn but create it anew - I thought I sorted it out - I was getting the same bs several times in a row, but occasionally it’d jump a lot again. So re-using learn definitely adds to inconsistency.

  • I reduced the memory allocation to just 1GB, and smaller first bs, as the first big batch was very slow. good enough for rough testing.

  • I changed the delta to a larger one to speed up the search - it doesn’t matter at this point to be precise till we sort out a consistent output.

The latest attempt is here: https://gist.github.com/stas00/8f0b32d371a2c3ffb84c27fb44ec8688

with try_bs=800 I was getting a consistent output quite a few times in a row

Not sure what to tell - I’ve been puzzling over this for quite some time now and don’t have any answers yet.

I’d say the next stage is to switch to pure pytorch and see if you get any consistency in there - doing the same but bypassing fastai.

p.s. peak measurement unreliability matters not here, it’s just for observation and understanding - OOM happens whether we measure it right nor not.
and even earlier I’d just try a straight gpu_mem_allocate_mbs (ipyexperiments/utils/mem.py) instead of fit and see if that is consistent, first.

I’m away till Monday, so will be able to continue experimenting then.

2 Likes

@stas Thanks so much for all the suggestions and parsing through my code. Let me have a deeper look and maybe go to pure pytorch and see how that turns out.

1 Like

Is it possible to use cpu memory besides GPU memory?

I have 64 GB CPU memory and 6 GB GPU memory(VRAM). when I run resnet50, GPU gets out of memory if bs=64. I can see CPU and GPU are used in parallel. some times CPU reaches to 100% and GPU 60% or more but I want to let it use CPU memory beside GPU memory. Is it possible? if yes how?

.

Thanks :slight_smile:

Technically it should be possible, but you will have no gain from that. Depending on application, GPU will complete your DL task in 5 to 100 times faster than the same on CPU, so CPU will become a bottleneck and overall everything will be even slower.

If speed is not of an issue then just use CPU since you have a lot more of it. You will just have to wait a lot longer for it to complete.

Otherwise lower your bs, image size, etc. or upgrade your GPU.

1 Like

I put some code to profile memory usage on forward and backwards, also added torch.utils.checkpoints support. It makes extensive use of the Hook class to access the model. It is missing the register_forward_pre_hook because the Hook does not have it (I could PR this).
You can check the notebook here.

It is something I don’t fully understand, but I am getting lower memory usage on my ResNet using this trick.
Resnet18 memory usage on forward/backwards pass normal vs sequential_checkpoints.

Most of this comes from here