# Lesson 1 Discussion ✅

(Francisco Ingham) #1163

(Harold) #1164

In lesson 1 notebook, when we plot the confusion matrix:

``````interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()
``````

Is it using a hold-out set to do it ? Is this set different from the set used to train the model ?

Thanks!

(Harold) #1165

Which platform are you using ?

(Jennifer Liu) #1166

Thanks! I kept looking for the method in the learn object - didn’t expect it to be a part of the img!

(Satish Kottapalli) #1167

For resnet, the typical normalization is mean subtraction; no division by std. The mean for ImageNet dataset is

``````mean= [103.939, 116.779, 123.68]
``````

(BGR channels)

#1168

@tsail good question. When we do one pass we learn which direction to adjust the weights in (up or down) based on the data we have seen and the labels we are trying to tune the network to recognise. The problem is we do not know by how much to adjust the weights. The learning rate represents the size of the adjustment to the weights (i.e. we multiply the weights by the learning rate). A small learning rate makes smaller adjustments to the weights and needs more iterations over the data (epochs) to get to an optimal point. The caveat is the learner can get trapped at various points but let’s not discuss that now as it could lead to confusion. A large learning rate will adjust the weights more aggressively. The next question is why dont we just use large learning rates? If we use a large learning rate we can overshoot the optimal point we are trying to narrow in on and because its large we end up bouncing back and forth without ever narrowing in on the optimal point. So it is common to start with a large learning rate and then gradually decrease it. However this is just one method. There are many methods (automatic and manual) to adjust the learning rate.

(Daniel) #1169

(Amulya) #1170

Hi, Did you fix this error?

(Bhuvana Kundumani) #1171

Hi, No.

I am working on Colab now and am uploading the data to my google drive and working on it.

(Amulya) #1173

Ah ok. AWS is my platform and I have my data in S3. I am getting the same error - KeyError: ‘content-length’.
Thanks.

#1174

I did not check, but I am pretty confident that this is pytorch.

(Larry) #1175

(Amulya) #1176

I encounter the same issue with different kaggle dataset. Did you fix this?

(Akshay) #1177

Has anyone tried using `mnist_stats` declared in fastai/vision/ data.py

When I try `data.normalize(mnist_stats)` I get the error mnist_stats not defined. I proceeded with declaring that in my notebook but maybe data.py needs to be updated? (the `__all__` part)

Will be in the next release.

#1179

Wondering what is the relationship between learning rate and batch size. As in first lecture in the end, I needed to decrease batch size due to memory issue.

(Scott H Hawley) #1180

1cycle policy (beginner-level) questions:

1. Terminology: I know what an “epoch” is, but I found “iterations” used in the tutorial to be confusing, so I want to check: Is an iteration just one loss (and perhaps backpropagation) calculation for one mini-batch? ( Leslie Smith’s paper seemed to use the two words interchangeably in some places, and in other places the terms are clearly are not equivalent. )

This would seem to agree with the definition in this post:

“Iterations is the number of batches needed to complete one epoch”

So…just checking: is this right? Thanks.

1. My second question is regarding where Sylvian says…

“Then, the length of this cycle should be slightly less than the total number of epochs”

…but how do you know what the total number of epochs should be, until you actually do the training and monitor the validation loss (i.e. looking for when it starts to flatten out)? Or does he really mean “total number of iterations per epoch”?

(I’m aware the fastaiv1 automates this policy so that we can simply “use it”, but I hope to understand what it’s doing.)

(Andrea de Luca) #1181

Since that issue of looking at the lowest loss vs strongest (neg) slope confuses many people, maybe the LR finder could plot the derivative of the LR vs. loss function. Then one could tell people just to select its minimum as the ideal learning rate. Some people are not comfortable with the notion of slope per se.

(Andrea de Luca) #1182

In such case, I would pick something a bit less than 10^-5. Nonetheless, I don’t like that plot. Try with different BSs.