Lesson 7 in-class chat ✅

I am trying to run the notebook “Lesson 7: Super resolution on Imagenet”… however it cannot find the dataset. Is there a link from where these data can be downloaded?

Oh dear, there are 3 colour channels, I can’t believe that didn’t occur to me. Thanks Jeremy !

Hello @jeremy, I was running the lesson7-resnet-mnist notebook and noticed that in the refactor section we use fastai’s conv_layer function to construct conv blocks that consist of Conv->ReLU->BatchNorm layers. Now, in the last conv block, we go from 16 input channels to 10 output channels and then flatten it to get our 10 output array. By using the conv_layer function as in the notebook, it still applies the ReLU activation to this last conv layer, and the nn.CrossEntropyLoss() applies LogSoftmax on top of the ReLU. Is this a mistake or is it intentional? Or having two non-linearities does not matter?

Thanks!

I was thinking that can I just penalize model with some value or should it be at some range? For example if I’m creating cat. dog, and mouse classifier and for some reason I don’t want that model predicts cats as mice. By adding some value to loss function every time when model predict cat picture to be mouse will teach model to not make this. How I can decide how big value this should be? For example infinity might be a good because then model learn that if it’s think that image might be a cat even 1% it is not going to predict mouse but also that big penalization might cause model to not predict mouse ever. Is this common trick to modify model?

I’m also interested this because could this technique be used to balance dataset. So if we have 10k cat images and 100 dog images could we give penalize loss function every time when model is predicting cat image as dog image?

In my opinion, the last layer with ReLU will give a better result than stop at BatchNorm2D because it will give us a larger range of value than just [-1, 1] . It is easier for the LogSoftmax to find the maximum value right ? I haven’t tested it yet but you can experiment and share us what you see.

In the Human Numbers notebook, where we are exploring the validation set tokens, I think:

x1[:,0] and y1[:,0]

should be

x1[0] and y1[0]

In the video for the lesson (around 1 hour, 48 minutes), Jeremy is showing the x1[:,0] notation and getting the results/output that you would expect to get from x1[0]. I don’t know why/how??

The rest of the notebook in the repo is using the x1[0] format to explore the text of the validation set.

I tried removing the extra relu layer and the accuracy stayed pretty much the same (averaged over 3 runs). Also tried removing the BatchNorm layer at the end and that reduced accuracy slightly.

So I guess having that relu does not harm the network?

1 Like

I am trying to implement the unet paper, but when concatenating the features from the contracting path the upsampled features, I noticed that in the case of the example in the paper, the features from the encoder are 64x64 and the upsampled features are 56x56.
My question is how do concatenate them, do you pad the upsampled features to 64x64 or do you crop the features from the encoder to 56x56.
@lesscomfortable

You crop the features from the encoder. Quoting the paper:

Every step in the expansive path consists of an upsampling of the
feature map followed by a 2x2 convolution (“up-convolution”) that halves the
number of feature channels, a concatenation with the correspondingly cropped
feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU.

The ‘contracting path’ is the encoder. It is called ‘contracting’ because it contracts/encodes the input to a representation of lower dimensionality.

Did you manage to solve this issue? Thanks in advance.

Hi!

I have a question. In the lesson7-resnet-mnist notebook, when we construct our first convolutional neural network, we have:

model = nn.Sequential(
conv(1, 8), # 14
nn.BatchNorm2d(8),
nn.ReLU(),
conv(8, 16), # 7
nn.BatchNorm2d(16),
nn.ReLU(),
conv(16, 32), # 4
nn.BatchNorm2d(32),
nn.ReLU(),
conv(32, 16), # 2
nn.BatchNorm2d(16),
nn.ReLU(),
conv(16, 10), # 1
nn.BatchNorm2d(10),
Flatten() # remove (1,1) grid
)

In the penultimate convolutional layer, we have an input size of 32 and an output size of 16, even though stride=2, which means we are reducing the number of channels before we output our labels, even as the size of the activation matrix is now halved; yet in other notebooks, whenever we reduce the size of our activation matrix, we tend to increase the number of channels accordingly. Given this, I would expect the last layers to go conv(32, 64) -> conv(64, 10).

Could someone explain the reasoning behind reducing the number of channels before reaching the final convolutional layer instead of continuing to double them when the size of our activation matrix is halved?

Could it be that dropping from 64 to 10 is too drastic and lead to too much information loss?

Hello Stas,
What are the best practices of gc.collect() and learn.purge? When do we need to run them? Are they equivalent? Could you give some examples of codes? Thank you.

1 Like

What are the best practices of gc.collect() and learn.purge? When do we need to run them? Are they equivalent? Could you give some examples of codes? Thank you.

update: See the beginning of this work in progress tutorial https://docs.fast.ai/tutorial.resources.html

the notes below have already been integrated into it.


When you are in the middle of training and want to switch gears and there is little GPU RAM left - you want to eliminate most of the objects that unnecessarily consume GPU and general RAM. So the hard way is to:

del learn
gc.collect()
learn = ... # reconstruct learn

you need to have a gc.collect() call because del learn won’t free up the object due to circular references. And while eventually gc.collect(), which handles that, will arrive, it’s too late for us, we need the RAM now! Hence the manual calling of it.

I wrote ipyexperiments initially to do just that, and delete any other variables automatically (since then it expanded to much more functionality).

plus you may want to call torch.cuda.empty_cache() if you actually want to see the memory freed with nvidia-smi - this is due to pytorch caching. The memory is free, but you can’t tell from nvidia-smi (but you can if you use pytorch memory allocation functions, and it’s another function to call, yikes). ipyexperiments will do it automatically for you.

But the annoying part is that either way you have to reconstruct the Learner object. So since learn.purge() was added you don’t have too. All of the above is done with just:

learn.purge()

and it’s also done automatically when you call learn.load(). You can override the default behavior of it not to purge with purge=False argument).

So, to conclude, when you finished your learn.fit() cycles and you are changing to a different image size, or you unfreeze, or you do anything else that no longer requires previous structures on GPU, you either call learn.purge() or learn.load('saved_name') and you should have most of your GPU RAM back as it was where you have just started, plus the allocated memory for the model. That’s of course, if you haven’t created some other variables that hold some GPU RAM tied up.

Therefore now you should be able to do a ton of things w/o ever needing to restart your notebook.

There will be better docs/tutorials soon, but this is probably clear enough to start using it.

Plus @sgugger has just added data.save()+learn.load_data() so you can now pre-create a bunch of data objects, save them, free the memory they use (general RAM only) and load them when you need them.

5 Likes

Hmm… I wouldn’t think so. Some final layers squash as many as thousands or even tens of thousands of activations down to one output variable.

Great documentation @stas! Many thanks.

My little questions:

  1. With the implementations of purge, there is no need anymore to run the following lines. True?
del learn
gc.collect()
learn = ... # reconstruct learn
  1. Instead, the best practice is to run learn.purge() before any big change like increasing image size in the databunch, unfreeze(), etc. True ?

  2. Do you recommend to run learn.purge() more often? For example, if I run learn.fit_one_cycle() through 10, 20, 30 epochs, is it a good practice to run learn.purge() after 10 epochs and not only after the end of my model training (ie, 30 epochs)?

  3. When you run data.save() (in order to save the databunch), the purge option is run by default. True?

  4. In the case of learn.save()?

  5. When you run learn.save() or learn.load(), the purge option is run by default. True?

  6. What is the learn.load_data() function? You mean learn.load()?

  7. In the case of learn.export(), you need to run learn.purge() before or it is run by default?

  8. What about load_learner()? You need to run learn.purge() before or it is run by default?

Other questions:

  1. I read the custom solutions about CUDA memory. Is the following equivalent to learn.purge() before running learn.fit_one_cycle()?
class gpu_mem_restore_ctx():
    " context manager to reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    def __enter__(self): return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        if not exc_val: return True
        traceback.clear_frames(exc_tb)
        raise exc_type(exc_val).with_traceback(exc_tb) from None

So now you can do:

with gpu_mem_restore_ctx():
    learn.fit_one_cycle(1, 1e-2)
  1. Why do you freeze before export as written in the following code? It is a best practice or even an obligation after training an unfreeze model?
# end of training
learn.fit_one_cycle(epochs)
learn.freeze()
learn.export()
1 Like

Thank you for the questions!

  1. With the implementations of purge, there is no need anymore to run the following lines. True?
del learn
gc.collect()
learn = ... # reconstruct learn
  1. Instead, the best practice is to run learn.purge() before any big change like increasing image size in the databunch, unfreeze(), etc. True ?

That’s the idea, yes.

  1. Do you recommend to run learn.purge() more often? For example, if I run learn.fit_one_cycle() through 10, 20, 30 epochs, is it a good practice to run learn.purge() after 10 epochs and not only after the end of my model training (ie, 30 epochs)?

No, you don’t need to inject learn.purge() between training cycles of the same setup:

learn.fit_one_cycle(epochs=10)
learn.fit_one_cycle(epochs=10)

The subsequent invocations of the training function do not consume more GPU RAM. Remember, when you train you just change the numbers in the nodes, but all the memory that is required for those numbers has already been allocated.

  1. When you run data.save() (in order to save the databunch), the purge option is run by default. True?

This one has nothing to do with learn, you’re just saving data.

  1. In the case of learn.save() ?

not at the moment - it shouldn’t by default because most likely you will want to keep the allocations for the next function, but it could be instrumented to optionally do so.

  1. When you run learn.save() or learn.load() , the purge option is run by default. True?

'learn.load(), yes, wrtlearn.save` see (5)

  1. What is the learn.load_data() function? You mean learn.load() ?

no, that’s it’s the counterpart of data.save.

The data save/load is totally new out of @sgugger mint house and is still needing to be documented.

  1. In the case of learn.export() , you need to run learn.purge() before or it is run by default?

Good thinking - I’ve been thinking about this one too, need to discuss this one with @sgugger.

Really what we need is is learn.destroy so that it’ll be like learn.purge, but will not re-load anything and turn learn into an empty shell. and so we won’t need to gc.collect(), as it won’t be taking up any memory.

  1. What about load_learner() ? You need to run learn.purge() before or it is run by default?

As you can see load_learner() returns a new one and doesn’t use the old one, so it’s really about your q8 above. i.e. how do we efficiently and concisely destroy the old learn object - since assigning to it will not do the right thing (the old object will still linger untill gc.collect() arrives)

  1. I read the custom solutions about CUDA memory. Is the following equivalent to learn.purge() before running learn.fit_one_cycle() ?
class gpu_mem_restore_ctx():
    " context manager to reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    def __enter__(self): return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        if not exc_val: return True
        traceback.clear_frames(exc_tb)
        raise exc_type(exc_val).with_traceback(exc_tb) from None

So now you can do:

with gpu_mem_restore_ctx():
    learn.fit_one_cycle(1, 1e-2)

no, this just clears the exception object and allows for the temporary variables to be freed up. But it’s possible that there are still some nuances to work out there wrt to memory reclamation - more experimentation is needed.

If you encounter some situations where this is not doing what you think it should let us know. But also remember not to use nvidia-smi as a monitor - since it will not always show you the real situation - it has to do with pytoch caching - sometimes the allocator decides to free a huge chunk of memory from its cache, sometimes it holds it, as therefore nvidia-smi output is a not a good tool in this situation - so either call torch.cuda.empty_cache or use ipyexperiments when you experiment.

  1. Why do you freeze before export as written in the following code? It is a best practice or even an obligation after training an unfreeze model?
# end of training
learn.fit_one_cycle(epochs)
learn.freeze()
learn.export()

because I don’t know what you plan on doing with the learn object next, so that was just an example of a typical end of training with a given setup, perhaps next you will not do inference… but I’m open to suggestions to make it less confusing - perhaps just a note that this is just an example.

I’m also seeing some problems with learn.purge, so it might take a bit of time for everything we have discussed so far to be so. I will go write some tests to make sure that eventually it’ll be so.

2 Likes

[EDIT] with point 5


Thanks Stas.

Until now (and before you do more experiments), I keep in mind 5 things from your great documentation:

  1. learn.purge() removes any of the Learner guts that are no longer needed and reloads the model on GPU, which also helps to reduce memory fragmentation (copy/paste of your text).
  2. Run learn.purge() before any big change in your model training (image size, unfreeze, etc.).
  3. When you run learn.load(), learn.purge() is done by default (no need to run it).
  4. After learn.export(), it is a good practice to run learn.purge().
  5. (soon, a learn.destroy implementation) In order to reclaim GPU memory or after a “CUDA out of memory exception”, run del learn; gc.collect() or learn=None; gc.collect() (they are equivalent codes). Do not forget to reconstruct your learner after (learn = ...).
7 Likes

Excellent summary, @pierreguillou.

The last one we are still sorting out, I’m pitching for learn.destroy for that situation. Until then del learn; gc.collect() will give you back more memory at the moment.

2 Likes

In order to reclaim GPU memory or after a “CUDA out of memory exception”, the following code is equivalent?

learn=None
gc.collect()