Lesson 7 in-class chat ✅

stas · February 13, 2019, 7:43pm

Thank you for the questions!

With the implementations of purge, there is no need anymore to run the following lines. True?
del learn
gc.collect()
learn = ... # reconstruct learn
Instead, the best practice is to run learn.purge() before any big change like increasing image size in the databunch, unfreeze(), etc. True ?

That’s the idea, yes.

Do you recommend to run learn.purge() more often? For example, if I run learn.fit_one_cycle() through 10, 20, 30 epochs, is it a good practice to run learn.purge() after 10 epochs and not only after the end of my model training (ie, 30 epochs)?

No, you don’t need to inject learn.purge() between training cycles of the same setup:

learn.fit_one_cycle(epochs=10)
learn.fit_one_cycle(epochs=10)

The subsequent invocations of the training function do not consume more GPU RAM. Remember, when you train you just change the numbers in the nodes, but all the memory that is required for those numbers has already been allocated.

When you run data.save() (in order to save the databunch), the purge option is run by default. True?

This one has nothing to do with learn, you’re just saving data.

In the case of learn.save() ?

not at the moment - it shouldn’t by default because most likely you will want to keep the allocations for the next function, but it could be instrumented to optionally do so.

When you run learn.save() or learn.load() , the purge option is run by default. True?

'learn.load(), yes, wrt learn.save` see (5)

What is the learn.load_data() function? You mean learn.load() ?

no, that’s it’s the counterpart of data.save.

The data save/load is totally new out of @sgugger mint house and is still needing to be documented.

In the case of learn.export() , you need to run learn.purge() before or it is run by default?

Good thinking - I’ve been thinking about this one too, need to discuss this one with @sgugger.

Really what we need is is learn.destroy so that it’ll be like learn.purge, but will not re-load anything and turn learn into an empty shell. and so we won’t need to gc.collect(), as it won’t be taking up any memory.

What about load_learner() ? You need to run learn.purge() before or it is run by default?

As you can see load_learner() returns a new one and doesn’t use the old one, so it’s really about your q8 above. i.e. how do we efficiently and concisely destroy the old learn object - since assigning to it will not do the right thing (the old object will still linger untill gc.collect() arrives)

I read the custom solutions about CUDA memory. Is the following equivalent to learn.purge() before running learn.fit_one_cycle() ?

class gpu_mem_restore_ctx():
    " context manager to reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    def __enter__(self): return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        if not exc_val: return True
        traceback.clear_frames(exc_tb)
        raise exc_type(exc_val).with_traceback(exc_tb) from None

So now you can do:

with gpu_mem_restore_ctx():
    learn.fit_one_cycle(1, 1e-2)

no, this just clears the exception object and allows for the temporary variables to be freed up. But it’s possible that there are still some nuances to work out there wrt to memory reclamation - more experimentation is needed.

If you encounter some situations where this is not doing what you think it should let us know. But also remember not to use nvidia-smi as a monitor - since it will not always show you the real situation - it has to do with pytoch caching - sometimes the allocator decides to free a huge chunk of memory from its cache, sometimes it holds it, as therefore nvidia-smi output is a not a good tool in this situation - so either call torch.cuda.empty_cache or use ipyexperiments when you experiment.

Why do you freeze before export as written in the following code? It is a best practice or even an obligation after training an unfreeze model?
# end of training
learn.fit_one_cycle(epochs)
learn.freeze()
learn.export()

because I don’t know what you plan on doing with the learn object next, so that was just an example of a typical end of training with a given setup, perhaps next you will not do inference… but I’m open to suggestions to make it less confusing - perhaps just a note that this is just an example.

I’m also seeing some problems with learn.purge, so it might take a bit of time for everything we have discussed so far to be so. I will go write some tests to make sure that eventually it’ll be so.