Lesson 6 - Official topic

This is interesting! I didn’t know this, and don’t yet know why batchnorm is implemented this way.

@DanielLam, @jcatanza, The short answer is: by unfreezing batchnorm our model get a better accuracy.

Now the why:
When we use a pretrained model, the batchnorm layer contains the mean, the standard deviation, the gamma and beta (2 trainable parameters) of the pretrained dataset (ImageNet in the case of images).

If we freeze our batchnorm layer with our dataset, we are feeding our model with our data (our images) and normalizing our batch with ImageNet mean, standard deviation, gamma and beta: Those values are off specially if our images are different from the ImageNet images. Therefore, your normalized activations are also off which leads to less than optimal results.

We keep the batchnorm layer unfrozen because while we are training the model and for each batch we will calculate the mean, the standard deviation of the activations of our data (batch of images), and updating (training) the corresponding gamma and beta, and using those results to normalize our activations of the current batch: The normalized activations are therefore more aligned with our images (dataset) as opposed to those obtained with a frozen batchnorm.

5 Likes

Thanks @farid, excellent explanation!

1 Like

Hi All,
I am trying to use multi-label classification for the bear classification example. When I use the MultiCategoryBlock , I get labels with 11 classes (corresponding to the unique alphabets in the class names) rather than the expected 3.

What am I missing?

Hi,

Thanks for the reply. Do you know if this has been published somewhere or a notebook showing the differences?

I’m quoting Jeremy from lesson 12 about batchnorm:

“Anytime something weird happens to your neural net it’s almost certain it’s because of the batchnorm because batchnorm makes everything weird!”

To answer your question, you might check out the 11a_transfer_learning.ipynb from lesson 12 - Part-2 2019 course. You can also jump to lesson 12 video portion where Jeremy explains the effect of the mean, the standard deviation, and the batchnorm trainable parameters on training a custom model that I am referring to in my previous post.

Here a little summary about the experiment that he showed in that video:
1 - He created a custom head for his model
2 - He froze the whole body (including the batchnorm layers) of the pretrained model.
3 - He trained his model for 3 epochs, and he got 54% accuracy
4 - He unfroze the whole body, and train the model for 5 epochs, and he got 56% accuracy (which was surprising low)

Then, he decided to unfreeze batchnorm from the beginning (meaning while training the custom head). He showed the following steps:
1 - He froze the whole body except the batchnorm layers of the pretrained model.
2 - He trained his model for 3 epochs, and he got 58% accuracy (which is already better than above)
3 - But more importantly, when he unfroze the whole body, and train the model for 5 epochs, he got 70% accuracy (and that’s a huge jump)

5 Likes

Interesting. Thank you for the response.

1 Like

I believe that yes, this is a binary classification task. You could get pictures of people wearing eyeglasses and people not wearing eyeglasses, using one of the imagenet trained models you should get great results.

I am still struggling with this. Given the above approach failed, I tried renaming all the files so that I can use the RegexLabeller approach to get the class names. This didn’t work.

I then tried re-running @muellerzr notebook

But I get the same error when I try to create the dataloaders. Is this an issue with paperspace?

---------------------------------------------------------------------------

TypeError Traceback (most recent call last)
in
----> 1 dls = pets_multi.dataloaders(untar_data(URLs.PETS)/“images”, bs=32)

/opt/conda/envs/fastai/lib/python3.7/site-packages/fastai2/data/block.py in dataloaders(self, source, path, verbose, **kwargs)
96
97 def dataloaders(self, source, path=’.’, verbose=False, **kwargs):
—> 98 dsets = self.datasets(source)
99 kwargs = {**self.dls_kwargs, **kwargs, ‘verbose’: verbose}
100 return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)

/opt/conda/envs/fastai/lib/python3.7/site-packages/fastai2/data/block.py in datasets(self, source, verbose)
93 splits = (self.splitter or RandomSplitter())(items)
94 pv(f"{len(splits)} datasets of sizes {’,’.join([str(len(s)) for s in splits])}", verbose)
—> 95 return Datasets(items, tfms=self._combine_type_tfms(), splits=splits, dl_type=self.dl_type, n_inp=self.n_inp, verbose=verbose)
96
97 def dataloaders(self, source, path=’.’, verbose=False, **kwargs):

/opt/conda/envs/fastai/lib/python3.7/site-packages/fastai2/data/core.py in init(self, items, tfms, tls, n_inp, dl_type, **kwargs)
272 def init(self, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, **kwargs):
273 super().init(dl_type=dl_type)
–> 274 self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
275 self.n_inp = (1 if len(self.tls)==1 else len(self.tls)-1) if n_inp is None else n_inp
276

/opt/conda/envs/fastai/lib/python3.7/site-packages/fastai2/data/core.py in (.0)
272 def init(self, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, **kwargs):
273 super().init(dl_type=dl_type)
–> 274 self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
275 self.n_inp = (1 if len(self.tls)==1 else len(self.tls)-1) if n_inp is None else n_inp
276

/opt/conda/envs/fastai/lib/python3.7/site-packages/fastcore/foundation.py in call(cls, x, args, **kwargs)
39 return x
40
—> 41 res = super().call(
((x,) + args), **kwargs)
42 res._newchk = 0
43 return res

/opt/conda/envs/fastai/lib/python3.7/site-packages/fastai2/data/core.py in init(self, items, tfms, use_list, do_setup, split_idx, train_setup, splits, types, verbose)
208 if isinstance(tfms,TfmdLists): tfms = tfms.tfms
209 if isinstance(tfms,Pipeline): do_setup=False
–> 210 self.tfms = Pipeline(tfms, split_idx=split_idx)
211 self.types = types
212 if do_setup:

/opt/conda/envs/fastai/lib/python3.7/site-packages/fastcore/transform.py in init(self, funcs, split_idx)
167 else:
168 if isinstance(funcs, Transform): funcs = [funcs]
–> 169 self.fs = L(ifnone(funcs,[noop])).map(mk_transform).sorted(key=‘order’)
170 for f in self.fs:
171 name = camel2snake(type(f).name)

/opt/conda/envs/fastai/lib/python3.7/site-packages/fastcore/foundation.py in sorted(self, key, reverse)
346 elif isinstance(key,int): k=itemgetter(key)
347 else: k=key
–> 348 return self._new(sorted(self.items, key=k, reverse=reverse))
349
350 @classmethod

TypeError: ‘<’ not supported between instances of ‘int’ and ‘L’

Does anyone have a working notebook demonstrating how to do this? I have been stuck at this for a while due to an error I am getting when creating the data loaders as described here

I have just tested @muellerzr’s notebook. It works fine. Your error message seems to be coming from fastcore, so most likely your fastcore and/or fastai2 installation(s) are not up-to-date, or they don’t have compatible versions to each other. Try upgrading/updating the installations, and run the notebook again.

Yijin

I try to build a image regression model (PointBlock). If I apply aug_transforms the keypoints sometimes are outside of the actual image. Is there a way to avoid that or discard the augmented image if that happens?

IIRC you need to adjust the padding type and not use crop (this is what’s happening)

1 Like

Jeremy says at the end of chapter 5 in the part “Further research” that we should try to improve pet breeds model’s accuracy and search in the forum for other student’s solutions. I can’t find the solutions of other students or their accuracies.
Can someone help me please? :slight_smile:

Hi all,

In the lesson for pet_breeds I noticed this about Cross Entropy Loss which I’m struggling to understand.

We’re only picking the loss from the column containing the correct label. We don’t need to consider the other columns, because by the definition of softmax, they add up to 1 minus the activation corresponding to the correct label. Therefore, making the activation for the correct label as high as possible must mean we’re also decreasing the activations of the remaining columns.

If we are only choosing the column with the correct label, wouldn’t we be maximising the loss and not minimising it?

For example if we have targets

#6 targets
targ = tensor([0,2,2,1,4,3])

And the following softmax activations

Indexing into the softmax activations we have,

We picked column 0 for the first prediction which has the highest activation. Wouldn’t this maximise the loss?

I have a colab notebook for reference.

Bit confused! :sweat_smile:

Any help will be much appreciated!

Cheers,
Adi

Hi Adi!

I think if you stopped right there, yes, it would maximize the loss. But you can just define the loss as the negative values of sm_acts, and now you’re minimizing loss :smiley:

Also, when we go one step further and take the negative log of the softmaxed activations, a confident correct prediction gets a small loss:
-log(0.99) = 0.0101
and a wrong prediction (the correct class has a low probability) gets a high loss:
-log(0.01) = 4.6052

Remember, nll_loss does not calculate the log despite its name.

Cheers :slight_smile:
Hannes

5 Likes

Hey Hannes,

Thanks again mate :smiley:

I ended up punching -log(0.09) vs -log(0.9) in a calculator last night and it made sense.

I’m trying to get an intuition for this and it makes sense now. When we take the negative, it minimizes the loss and if we take the log of that, it helps make the function sensitive to small difference such as 0.99 and 0.999 (which is really a 10X improvement).

I get the step is actually log_softmax followed by nll_loss despite the name. :crazy_face:

Thanks for jumping in and validating!

Adi

I think that sums it up nicely :smiley:

Glad I could help. Your questions always make me dig into the material again, which is great!

Hey guys, I have recently been working my way through the course, and reached the chapter on collaborative filtering. I got kind of stuck here.

Jeremy gave an example of what we are trying to achieve by fitting the latent factors using an Excel sheet.

Here, Jeremy took a batch of 15: so 15 movies and 15 users, and 5 latent factors. Jeremy then calculated the predictions by taking the dot product, yielding 15x15 predictions. I am clear till here.

However, while defining the model from scratch:

class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)

    def forward(self, x):
        users = self.user_factors(x[:,0]) 
        movies = self.movie_factors(x[:,1]) 
        return (users * movies).sum(dim=1)

we are using (users * movies).sum(dim=1) that yields a shape of batch size. So for a batch of 15, it would yield a tensor of 15 predictions. Shouldn’t it be 15 x 15, a prediction for each combination of user and movie?

Thanks!

Hi,
I also didn’t find other students solutions, I’m sure there is an official topic somewhere and if someone can point to the link it will be very helpful :slight_smile:

For myself I just tried to experiment according to Jeremy’s suggestions, I did manage to improve the model but I think it’s mostly a “lucky run” given the stochastic nature of the algorithm.

I summarized my experiments in this Excel table (and added some suggestions for further experiments based on the results), if someone interested I will clean the notebook and prepare a more detailed blog post about this task:

![image|690x268](upload://rCsK9Ou5cjIhKMy2yxu83VwqYNO.png)