Lesson 1 In-Class Discussion ✅

In lesson 1 notebook, when we plot the confusion matrix:

interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

Is it using a hold-out set to do it ? Is this set different from the set used to train the model ?

Thanks!

Which platform are you using ?

Thanks! I kept looking for the method in the learn object - didn’t expect it to be a part of the img!

For resnet, the typical normalization is mean subtraction; no division by std. The mean for ImageNet dataset is

mean= [103.939, 116.779, 123.68]

(BGR channels)

@tsail good question. When we do one pass we learn which direction to adjust the weights in (up or down) based on the data we have seen and the labels we are trying to tune the network to recognise. The problem is we do not know by how much to adjust the weights. The learning rate represents the size of the adjustment to the weights (i.e. we multiply the weights by the learning rate). A small learning rate makes smaller adjustments to the weights and needs more iterations over the data (epochs) to get to an optimal point. The caveat is the learner can get trapped at various points but let’s not discuss that now as it could lead to confusion. A large learning rate will adjust the weights more aggressively. The next question is why dont we just use large learning rates? If we use a large learning rate we can overshoot the optimal point we are trying to narrow in on and because its large we end up bouncing back and forth without ever narrowing in on the optimal point. So it is common to start with a large learning rate and then gradually decrease it. However this is just one method. There are many methods (automatic and manual) to adjust the learning rate.

4 Likes

Google colab

Hi, Did you fix this error?

Hi, No.

I am working on Colab now and am uploading the data to my google drive and working on it.

Ah ok. AWS is my platform and I have my data in S3. I am getting the same error - KeyError: ‘content-length’.
Thanks.

I did not check, but I am pretty confident that this is pytorch.

Thanks for your explanation @maral!

I encounter the same issue with different kaggle dataset. Did you fix this?

Has anyone tried using mnist_stats declared in fastai/vision/ data.py

When I try data.normalize(mnist_stats) I get the error mnist_stats not defined. I proceeded with declaring that in my notebook but maybe data.py needs to be updated? (the __all__ part)

Will be in the next release.

1 Like

Wondering what is the relationship between learning rate and batch size. As in first lecture in the end, I needed to decrease batch size due to memory issue.

1cycle policy (beginner-level) questions:

In Lesson 1, Jeremy said we’d be using the 1cycle policy for scheduling learning rates. I wanted to know more about this method, so I read @sgugger’s excellent tutorial about this, but I got confused about two things:

  1. Terminology: I know what an “epoch” is, but I found “iterations” used in the tutorial to be confusing, so I want to check: Is an iteration just one loss (and perhaps backpropagation) calculation for one mini-batch? ( Leslie Smith’s paper seemed to use the two words interchangeably in some places, and in other places the terms are clearly are not equivalent. )

This would seem to agree with the definition in this post:

“Iterations is the number of batches needed to complete one epoch”

So…just checking: is this right? Thanks.

  1. My second question is regarding where Sylvian says…

“Then, the length of this cycle should be slightly less than the total number of epochs”

…but how do you know what the total number of epochs should be, until you actually do the training and monitor the validation loss (i.e. looking for when it starts to flatten out)? Or does he really mean “total number of iterations per epoch”?

(I’m aware the fastaiv1 automates this policy so that we can simply “use it”, but I hope to understand what it’s doing.)

1 Like

Since that issue of looking at the lowest loss vs strongest (neg) slope confuses many people, maybe the LR finder could plot the derivative of the LR vs. loss function. Then one could tell people just to select its minimum as the ideal learning rate. Some people are not comfortable with the notion of slope per se.

In such case, I would pick something a bit less than 10^-5. Nonetheless, I don’t like that plot. Try with different BSs.

Yeah I’ve tried that, but it didn’t quite work well. We still don’t have a hard and fast rule - if someone can come up with something that works exactly everytime, that would be great!

3 Likes

most confusng thign happen when you see a curve declining then goes flat with slight up n down for many iterations ,then coming down a bit not that low as before before finally rocketing…

so in that case should one chose the declining one or the rate at which curve was flat…

1 Like