Thank you for the questions!
- With the implementations of purge, there is no need anymore to run the following lines. True?
del learn
gc.collect()
learn = ... # reconstruct learn
- Instead, the best practice is to run
learn.purge()
before any big change like increasing image size in the databunch, unfreeze(), etc. True ?
That’s the idea, yes.
- Do you recommend to run
learn.purge()
more often? For example, if I run learn.fit_one_cycle()
through 10, 20, 30 epochs, is it a good practice to run learn.purge()
after 10 epochs and not only after the end of my model training (ie, 30 epochs)?
No, you don’t need to inject learn.purge()
between training cycles of the same setup:
learn.fit_one_cycle(epochs=10)
learn.fit_one_cycle(epochs=10)
The subsequent invocations of the training function do not consume more GPU RAM. Remember, when you train you just change the numbers in the nodes, but all the memory that is required for those numbers has already been allocated.
- When you run
data.save()
(in order to save the databunch), the purge
option is run by default. True?
This one has nothing to do with learn
, you’re just saving data.
- In the case of
learn.save()
?
not at the moment - it shouldn’t by default because most likely you will want to keep the allocations for the next function, but it could be instrumented to optionally do so.
- When you run
learn.save()
or learn.load()
, the purge
option is run by default. True?
'learn.load(), yes, wrt
learn.save` see (5)
- What is the
learn.load_data()
function? You mean learn.load()
?
no, that’s it’s the counterpart of data.save
.
The data save/load is totally new out of @sgugger mint house and is still needing to be documented.
- In the case of
learn.export()
, you need to run learn.purge()
before or it is run by default?
Good thinking - I’ve been thinking about this one too, need to discuss this one with @sgugger.
Really what we need is is learn.destroy
so that it’ll be like learn.purge
, but will not re-load anything and turn learn
into an empty shell. and so we won’t need to gc.collect()
, as it won’t be taking up any memory.
- What about
load_learner()
? You need to run learn.purge()
before or it is run by default?
As you can see load_learner()
returns a new one and doesn’t use the old one, so it’s really about your q8 above. i.e. how do we efficiently and concisely destroy the old learn object - since assigning to it will not do the right thing (the old object will still linger untill gc.collect()
arrives)
- I read the custom solutions about CUDA memory. Is the following equivalent to
learn.purge()
before running learn.fit_one_cycle()
?
class gpu_mem_restore_ctx():
" context manager to reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
def __enter__(self): return self
def __exit__(self, exc_type, exc_val, exc_tb):
if not exc_val: return True
traceback.clear_frames(exc_tb)
raise exc_type(exc_val).with_traceback(exc_tb) from None
So now you can do:
with gpu_mem_restore_ctx():
learn.fit_one_cycle(1, 1e-2)
no, this just clears the exception object and allows for the temporary variables to be freed up. But it’s possible that there are still some nuances to work out there wrt to memory reclamation - more experimentation is needed.
If you encounter some situations where this is not doing what you think it should let us know. But also remember not to use nvidia-smi as a monitor - since it will not always show you the real situation - it has to do with pytoch caching - sometimes the allocator decides to free a huge chunk of memory from its cache, sometimes it holds it, as therefore nvidia-smi
output is a not a good tool in this situation - so either call torch.cuda.empty_cache or use ipyexperiments when you experiment.
- Why do you
freeze
before export
as written in the following code? It is a best practice or even an obligation after training an unfreeze model?
# end of training
learn.fit_one_cycle(epochs)
learn.freeze()
learn.export()
because I don’t know what you plan on doing with the learn object next, so that was just an example of a typical end of training with a given setup, perhaps next you will not do inference… but I’m open to suggestions to make it less confusing - perhaps just a note that this is just an example.
I’m also seeing some problems with learn.purge
, so it might take a bit of time for everything we have discussed so far to be so. I will go write some tests to make sure that eventually it’ll be so.