Lesson 7 in-class chat ✅

Hello Stas,
What are the best practices of gc.collect() and learn.purge? When do we need to run them? Are they equivalent? Could you give some examples of codes? Thank you.

1 Like

What are the best practices of gc.collect() and learn.purge? When do we need to run them? Are they equivalent? Could you give some examples of codes? Thank you.

update: See the beginning of this work in progress tutorial https://docs.fast.ai/tutorial.resources.html

the notes below have already been integrated into it.


When you are in the middle of training and want to switch gears and there is little GPU RAM left - you want to eliminate most of the objects that unnecessarily consume GPU and general RAM. So the hard way is to:

del learn
gc.collect()
learn = ... # reconstruct learn

you need to have a gc.collect() call because del learn won’t free up the object due to circular references. And while eventually gc.collect(), which handles that, will arrive, it’s too late for us, we need the RAM now! Hence the manual calling of it.

I wrote ipyexperiments initially to do just that, and delete any other variables automatically (since then it expanded to much more functionality).

plus you may want to call torch.cuda.empty_cache() if you actually want to see the memory freed with nvidia-smi - this is due to pytorch caching. The memory is free, but you can’t tell from nvidia-smi (but you can if you use pytorch memory allocation functions, and it’s another function to call, yikes). ipyexperiments will do it automatically for you.

But the annoying part is that either way you have to reconstruct the Learner object. So since learn.purge() was added you don’t have too. All of the above is done with just:

learn.purge()

and it’s also done automatically when you call learn.load(). You can override the default behavior of it not to purge with purge=False argument).

So, to conclude, when you finished your learn.fit() cycles and you are changing to a different image size, or you unfreeze, or you do anything else that no longer requires previous structures on GPU, you either call learn.purge() or learn.load('saved_name') and you should have most of your GPU RAM back as it was where you have just started, plus the allocated memory for the model. That’s of course, if you haven’t created some other variables that hold some GPU RAM tied up.

Therefore now you should be able to do a ton of things w/o ever needing to restart your notebook.

There will be better docs/tutorials soon, but this is probably clear enough to start using it.

Plus @sgugger has just added data.save()+learn.load_data() so you can now pre-create a bunch of data objects, save them, free the memory they use (general RAM only) and load them when you need them.

5 Likes

Hmm… I wouldn’t think so. Some final layers squash as many as thousands or even tens of thousands of activations down to one output variable.

Great documentation @stas! Many thanks.

My little questions:

  1. With the implementations of purge, there is no need anymore to run the following lines. True?
del learn
gc.collect()
learn = ... # reconstruct learn
  1. Instead, the best practice is to run learn.purge() before any big change like increasing image size in the databunch, unfreeze(), etc. True ?

  2. Do you recommend to run learn.purge() more often? For example, if I run learn.fit_one_cycle() through 10, 20, 30 epochs, is it a good practice to run learn.purge() after 10 epochs and not only after the end of my model training (ie, 30 epochs)?

  3. When you run data.save() (in order to save the databunch), the purge option is run by default. True?

  4. In the case of learn.save()?

  5. When you run learn.save() or learn.load(), the purge option is run by default. True?

  6. What is the learn.load_data() function? You mean learn.load()?

  7. In the case of learn.export(), you need to run learn.purge() before or it is run by default?

  8. What about load_learner()? You need to run learn.purge() before or it is run by default?

Other questions:

  1. I read the custom solutions about CUDA memory. Is the following equivalent to learn.purge() before running learn.fit_one_cycle()?
class gpu_mem_restore_ctx():
    " context manager to reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    def __enter__(self): return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        if not exc_val: return True
        traceback.clear_frames(exc_tb)
        raise exc_type(exc_val).with_traceback(exc_tb) from None

So now you can do:

with gpu_mem_restore_ctx():
    learn.fit_one_cycle(1, 1e-2)
  1. Why do you freeze before export as written in the following code? It is a best practice or even an obligation after training an unfreeze model?
# end of training
learn.fit_one_cycle(epochs)
learn.freeze()
learn.export()
1 Like

Thank you for the questions!

  1. With the implementations of purge, there is no need anymore to run the following lines. True?
del learn
gc.collect()
learn = ... # reconstruct learn
  1. Instead, the best practice is to run learn.purge() before any big change like increasing image size in the databunch, unfreeze(), etc. True ?

That’s the idea, yes.

  1. Do you recommend to run learn.purge() more often? For example, if I run learn.fit_one_cycle() through 10, 20, 30 epochs, is it a good practice to run learn.purge() after 10 epochs and not only after the end of my model training (ie, 30 epochs)?

No, you don’t need to inject learn.purge() between training cycles of the same setup:

learn.fit_one_cycle(epochs=10)
learn.fit_one_cycle(epochs=10)

The subsequent invocations of the training function do not consume more GPU RAM. Remember, when you train you just change the numbers in the nodes, but all the memory that is required for those numbers has already been allocated.

  1. When you run data.save() (in order to save the databunch), the purge option is run by default. True?

This one has nothing to do with learn, you’re just saving data.

  1. In the case of learn.save() ?

not at the moment - it shouldn’t by default because most likely you will want to keep the allocations for the next function, but it could be instrumented to optionally do so.

  1. When you run learn.save() or learn.load() , the purge option is run by default. True?

'learn.load(), yes, wrtlearn.save` see (5)

  1. What is the learn.load_data() function? You mean learn.load() ?

no, that’s it’s the counterpart of data.save.

The data save/load is totally new out of @sgugger mint house and is still needing to be documented.

  1. In the case of learn.export() , you need to run learn.purge() before or it is run by default?

Good thinking - I’ve been thinking about this one too, need to discuss this one with @sgugger.

Really what we need is is learn.destroy so that it’ll be like learn.purge, but will not re-load anything and turn learn into an empty shell. and so we won’t need to gc.collect(), as it won’t be taking up any memory.

  1. What about load_learner() ? You need to run learn.purge() before or it is run by default?

As you can see load_learner() returns a new one and doesn’t use the old one, so it’s really about your q8 above. i.e. how do we efficiently and concisely destroy the old learn object - since assigning to it will not do the right thing (the old object will still linger untill gc.collect() arrives)

  1. I read the custom solutions about CUDA memory. Is the following equivalent to learn.purge() before running learn.fit_one_cycle() ?
class gpu_mem_restore_ctx():
    " context manager to reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    def __enter__(self): return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        if not exc_val: return True
        traceback.clear_frames(exc_tb)
        raise exc_type(exc_val).with_traceback(exc_tb) from None

So now you can do:

with gpu_mem_restore_ctx():
    learn.fit_one_cycle(1, 1e-2)

no, this just clears the exception object and allows for the temporary variables to be freed up. But it’s possible that there are still some nuances to work out there wrt to memory reclamation - more experimentation is needed.

If you encounter some situations where this is not doing what you think it should let us know. But also remember not to use nvidia-smi as a monitor - since it will not always show you the real situation - it has to do with pytoch caching - sometimes the allocator decides to free a huge chunk of memory from its cache, sometimes it holds it, as therefore nvidia-smi output is a not a good tool in this situation - so either call torch.cuda.empty_cache or use ipyexperiments when you experiment.

  1. Why do you freeze before export as written in the following code? It is a best practice or even an obligation after training an unfreeze model?
# end of training
learn.fit_one_cycle(epochs)
learn.freeze()
learn.export()

because I don’t know what you plan on doing with the learn object next, so that was just an example of a typical end of training with a given setup, perhaps next you will not do inference… but I’m open to suggestions to make it less confusing - perhaps just a note that this is just an example.

I’m also seeing some problems with learn.purge, so it might take a bit of time for everything we have discussed so far to be so. I will go write some tests to make sure that eventually it’ll be so.

2 Likes

[EDIT] with point 5


Thanks Stas.

Until now (and before you do more experiments), I keep in mind 5 things from your great documentation:

  1. learn.purge() removes any of the Learner guts that are no longer needed and reloads the model on GPU, which also helps to reduce memory fragmentation (copy/paste of your text).
  2. Run learn.purge() before any big change in your model training (image size, unfreeze, etc.).
  3. When you run learn.load(), learn.purge() is done by default (no need to run it).
  4. After learn.export(), it is a good practice to run learn.purge().
  5. (soon, a learn.destroy implementation) In order to reclaim GPU memory or after a “CUDA out of memory exception”, run del learn; gc.collect() or learn=None; gc.collect() (they are equivalent codes). Do not forget to reconstruct your learner after (learn = ...).
7 Likes

Excellent summary, @pierreguillou.

The last one we are still sorting out, I’m pitching for learn.destroy for that situation. Until then del learn; gc.collect() will give you back more memory at the moment.

2 Likes

In order to reclaim GPU memory or after a “CUDA out of memory exception”, the following code is equivalent?

learn=None
gc.collect()

Correct and it is so in both cases only assuming there are no other vars that refer to learn.

learn.destroy is almost ready, you can try:

def destroy(self):
    "Free the Learner internals, leaving just an empty shell that consumes no memory"
    attrs = [k for k in self.__dict__.keys() if not k.startswith("__")]
    for a in attrs: delattr(self, a)
    gc.collect()

but it’s not @sgugger approved yet.

You just call:

learn.destroy()

no need for del, None, gc.collect()

I’ve updated my summary with a fifth point about learn=None; gc.collect().

About learn.destroy: it is equal to learn=None; gc.collect()?

Pretty much. It leaves a hollow learn shell, which takes close to zero memory, so even if you don’t reassign to it later, it doesn’t matter. destroy() pretty much resets it to {}, but keeps its methods, which I guess could be deleted too. most likely that’s what should be done. otherwise it’d be misleading - still having its methods intact but no internal data to work with. So it will most likely be slightly different in the final version.

1 Like

I agree, this needs to be fixed.

The in-class example of “Human Numbers” is a very helpful example for how to encode and operate on language-like vocabularies.
However, I am having difficulty envisioning how to apply fastai’s RNN structure to a continuous variable. For instance, imagine wanting to predict a store’s sales using historical performance (ala the Rossman’s problem from Lesson 6) but using an RNN instead of tabular data.

Does anyone have a worked example (ideally on Kaggle) that I could walk through to get a better understanding of RNN on a continuous variable?

Thanks!

1 Like

In the “superres-gan” example, I am unable to train the discriminator/generator pair using learn.fit(40,lr). The pretraining of both networks worked fine, but when I try to fit the pretrained model I get a nonspecific error. I’ve copied the full stack-trace below.

AttributeError                            Traceback (most recent call last)
<ipython-input-27-d44c81445766> in <module>
----> 1 learn.fit(40,lr)

/opt/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    188         if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
    189         fit(epochs, self.model, self.loss_func, opt=self.opt, data=self.data, metrics=self.metrics,
--> 190             callbacks=self.callbacks+callbacks)
    191 
    192     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

/opt/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, model, loss_func, opt, data, callbacks, metrics)
     90             cb_handler.on_epoch_begin()
     91             for xb,yb in progress_bar(data.train_dl, parent=pbar):
---> 92                 xb, yb = cb_handler.on_batch_begin(xb, yb)
     93                 loss = loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     94                 if cb_handler.on_batch_end(loss): break

/opt/anaconda3/lib/python3.7/site-packages/fastai/callback.py in on_batch_begin(self, xb, yb, train)
    253         self.state_dict['train'],self.state_dict['stop_epoch'] = train,False
    254         self.state_dict['skip_step'],self.state_dict['skip_zero'] = False,False
--> 255         self('batch_begin', mets = not self.state_dict['train'])
    256         return self.state_dict['last_input'], self.state_dict['last_target']
    257 

/opt/anaconda3/lib/python3.7/site-packages/fastai/callback.py in __call__(self, cb_name, call_mets, **kwargs)
    224         if call_mets:
    225             for met in self.metrics: self._call_and_update(met, cb_name, **kwargs)
--> 226         for cb in self.callbacks: self._call_and_update(cb, cb_name, **kwargs)
    227 
    228     def set_dl(self, dl:DataLoader):

/opt/anaconda3/lib/python3.7/site-packages/fastai/callback.py in _call_and_update(self, cb, cb_name, **kwargs)
    215         "Call `cb_name` on `cb` and update the inner state."
    216         new = ifnone(getattr(cb, f'on_{cb_name}')(**self.state_dict, **kwargs), dict())
--> 217         for k,v in new.items():
    218             if k not in self.state_dict:
    219                 raise Exception(f"{k} isn't a valid key in the state of the callbacks.")

AttributeError: 'tuple' object has no attribute 'items'

This seems to be an issue with training GANs in general, I encounter the same error when attempting to fit the learner in the “wgan” example

When attempting to train the “2a” model in the “superres” example, I’m getting a CUDA out of memory error. I tried halving the batch size, from 32 down to 16, but am still getting the error.

Strangely, the error only shows up on the second epoch of training (it gets through the first epoch just fine, and prints the results), which makes me think that it is caching data between loops that it shouldn’t be. The error message says 1.49 GiB cached which seems like a lot…

Has anyone else encountered a similar issue and figured out how to fix it?

Have you tried decreasing your image size? Gans in general eat up a ton of GPU memory, and while I didn’t have issues with running the notebooks, I did have issues with using Gans in a personal project of mine that I’m currently working on (based off the superres notebook). Playing with both batch size and image resolution helped with finding that “sweet spot” for training where I don’t get OOM errors.

What is the correct order for BatchNorm and ReLU layers.

In “lesson7-resnet-mnist.ipynb” , “Basic CNN with batchnorm” has order — Conv2d - BatchNorm - ReLU
But fast.ai method ‘conv_layer’ has order as ---- Conv2d - ReLU - BatchNorm

Please help.

1 Like

I’m trying to run the GAN example with a different dataset. However I get this output:

What does this mean?

Hello! I have a little bit confusion about kernels, filters and convolutions. I understand that one black-wight digit image in MNIST data set has size 1x28x28 and 1 is because of the channel.,because we have gray scale. RGB images have 3 channels(red,green,blue).
But I can not understand in 13.34 minute of lesson 7 where we have the model when we say that we have 8 channels and we pick it because we just want to be 8!!!What we mean by this? These are 8 filters? Do they help us to predict the final outcom? And how is it combined with this 1 channel that is about the grayscale color? Is it the same process that we do with kernels? I would appreciate any help, thank you!