Jupyter+pytorch, or cuda memory help: stop notebook mid training

YangL · April 5, 2018, 8:28pm

So, here is what happens sometimes in jupyternotebook:

I make a mistake, e.g., make the epochs too long, and I want to stop my training.
I stop the offending learn.fit line.
The memory is not freed up, and every time I try to train, I get
cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1512387374934/work/torch/lib/THC/generic/THCStorage.cu:58

Which is understandable.

What’s not clear is what to do next: I tried to del all relevant variables, such as md and learner, but there’s still no memory freed up.

Researching online to see how to free memory, doesn’t help much.

So, is my only option to restart the kernel?

Interogativ · April 5, 2018, 10:37pm

Once you run out of CUDA memory you’re hosed. You must Restart the kernel. See my posts in the Part 2 - Lesson 10 In Class forum about CUDA memory

kachio · April 6, 2018, 12:59am

Yes you’ll have to restart the kernel. You may also need to free up space on your GPU. Here’s how:

In terminal run:

nvidia-smi

this let’s you see the processes running on the GPU.

To kill the process(es) type:

sudo kill -9 PID // sudo kill -9 PID

where PID is the process id number.

jeremy · April 6, 2018, 1:22am

I think recent pytorch has a method to clear the cache. Don’t recall the name off-hand, but have a search around and let us know if you find it.

YangL · April 6, 2018, 5:11am

torch.cuda.empty_cache()

Interrupted learner.fit, and ran empty_cache().
Memory usage down from 8gb to 3gb.
learner is there, everything is there, memory is freed.
It works! I’m so happy!

… I’d like my Nobel peace prize now.

YangL · April 6, 2018, 5:18am

You know, maybe we should put this in the except catch of training part.
Free memory automatically when exception is caught.

Not necessary? Maybe?

jeremy · April 6, 2018, 6:28am

That’s a very clever idea. Want to try it and see if it works? (Should be in finally block, not exception block, perhaps?)

narvind2003 · April 6, 2018, 5:00pm

Oooh nice!

Interogativ · April 6, 2018, 8:11pm

Free memory automatically or NOT

I had already tried this approach using this code (yes it was correctly indented):

try:
learn.fit(lrs, 1, wds=wd, cycle_len=14, use_clr=32,10))
except RuntimeError as ex:
print(“Uh, Oh!- {}”.format(ex))
finally:
torch.cuda.empty_cache()

The cache was indeed cleared, but when I ran it again after changing the BS, I got

RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1512387374934/work/torch/lib/THC/generic/THCTensorMath.cu:35

Again ReStarting the notebook (Kernel, Restart) seems to be the ONLY thing that “fixes” a cuda out of memory error.

narvind2003 · April 6, 2018, 10:09pm

Oooh no!

Interogativ · April 7, 2018, 7:24pm

Why can’t I just wrap CUDA memory errors in an exception?

I’ve been digging into this for a couple of days, first I went to the pytorch.cuda source code and found this:

def synchronize(self):
“”“Wait for all the kernels in this stream to complete.
… note:: This is a wrapper around cudaStreamSynchronize(): see
CUDA documentation_ for more info.
… _CUDA documentation:
CUDA Runtime API :: CUDA Toolkit Documentation
“””
check_error(cudart().cudaStreamSynchronize(self))

Ah ha, so pytorch uses the "CUDA runtime API." So after signing on to the CUDA forums and searching for “memory errors”,“exceptions”,etc; I found exactly nothing relevant. Then I decided to read the runtime API notes and found this:

Context management
Context management can be done through the driver API, but is not exposed in the runtime API. Instead, the runtime API decides itself which context to use for a thread: if a context has been made current to the calling thread through the driver API, the runtime will use that, but if there is no such context, it uses a “primary context.” Primary contexts are created as needed, one per device per process, are reference-counted, and are then destroyed when there are no more references to them. Within one process, all users of the runtime API will share the primary context, unless a context has been made current to each thread. The context that the runtime uses, i.e, either the current context or primary context, can be synchronized with cudaDeviceSynchronize(), and destroyed with cudaDeviceReset().
Using the runtime API with primary contexts has its tradeoffs, however. It can cause trouble for users writing plug-ins for larger software packages, for example, because if all plug-ins run in the same process, they will all share a context but will likely have no way to communicate with each other. So, if one of them calls cudaDeviceReset() after finishing all its CUDA work, the other plug-ins will fail because the context they were using was destroyed without their knowledge. To avoid this issue, CUDA clients can use the driver API to create and set the current context, and then use the runtime API to work with it. However, contexts may consume significant resources, such as device memory, extra host threads, and performance costs of context switching on the device. This runtime-driver context sharing is important when using the driver API in conjunction with libraries built on the runtime API, such as cuBLAS or cuFFT.

My guess is that in the case of Windows you’re really using a DLL bound thru COM and in Linux, a Shared object with a wrapper. I think that Python and Pytorch never really have the level of control necessary to recover from the memory error. Once the “context” is hosed, you’re hosed until you do a cudaDeviceReset() (Kernel, Restart to us).

jeremy · April 7, 2018, 7:27pm

Yes good searching. I actually made a request on the pytorch forum related to this:

Interogativ · April 7, 2018, 9:00pm

learn.summary() for RNN_learner()

In my never ending quest to figure out a way around the CUDA out of memory problem, I’m trying to estimate how much CUDA memory will be allocated before actually allocating it. As part of this process, I need the equivalent of learn.summary() for RNN models and discovered that learn.summary() fails because it can’t find the sz property for the TextDataset.trn_clas classes. Any help figuring out the model inputs and outputs would be appreciated.

Interogativ · April 8, 2018, 9:53pm

Pytorch 0.4 has a torch.cuda.memory_allocated() function. I tried to add this to @jeremy’s learn.summary() for cnns at the beginning and end of each hook block iteration to see how much memory was added by the block and then I was going to return the cuda memory stats, along with the other summary data.

Unfortunately the machine I was using was Windows 10. I checked and Conda hasn’t yet got a pytorch 0.4 for Windows. I wouldn’t make a mod like this unless everybody could use it. But eventually 0.4 will support Windows and fastai will support it. I’m hesitating to make the jump to 0.4 on my Linux box because I don’t want to be incompatible with the current fastai library or the current Jupyter notebooks, but when fastai goes to pytorch 0.4 I’ll revisit this. My question for @jeremy is; When pytorch 0.4 will be supported. in fastai?

jeremy · April 8, 2018, 11:57pm

I’m using fastai with 0.4 at the moment and it’s working fine FYI.

narvind2003 · April 9, 2018, 1:41am

Does that mean, Jeremy, you have a copy of the notebooks with the V functions (variable) removed? I believe they were also adding rank 0 tensors.

When will we officially switch to v0.4? As part of the MOOC?

jeremy · April 9, 2018, 3:54am

No it doesn’t - I’m using master, which seems to have an 0.4 version, but none of those changes appear. Perhaps there’s a separate branch for that?

jeremy · April 9, 2018, 3:55am

Not sure yet how to deal with that. We’ll have to see when it comes out and make a call then.

narvind2003 · April 9, 2018, 4:00am

Ok. Let me try upgrading locally to 0.4 as well.

narvind2003 · April 9, 2018, 1:14pm

I wonder if it’s the value that’s in the environment yml files.
I see this for PyTorch: