# Lesson 1 Discussion ✅

(jaideep v) #1184

most confusng thign happen when you see a curve declining then goes flat with slight up n down for many iterations ,then coming down a bit not that low as before before finally rocketing…

so in that case should one chose the declining one or the rate at which curve was flat…

(jaideep v) #1185

if some one can ans this…i found his quite often

(Nathan Hubens) #1186

Hi, can someone tell me why we don’t use the `lr_finder` before unfreezing the layers ? If I have well understood the `fit_one_cycle` uses by default a `max_lr=0.003` and a `min_lr=0.003/25`.

But when looking at the lr curve before training the model at all , it seems that we can use higher learning rates to train the few last layers, so why isn’t it done in Lesson1 ?

(Ramesh Kumar Singh) #1187

@NathanHub, its true… you can use lr_find before unfreeze but in the notebook Jeremy wanted to show the LR used is not great and how using lr_finder much better LR could be found. LR is the rate at which we travel towards minima. Until and unless we don’t know how the loss happens we won’t know whats the maximum LR we could use. Before training it’s just a guess and Jeremy has observed that 3e-3 is well suited for most of the cases. I hope this helps.

(danny iskandar) #1188

OK, let me try to answer I might be wrong. Learning rate is big of a step do you want to take downhill towards global minimum or in order to reach high accuracy or low error rate. The bigger the step is you tend to go fast towards finding the right weights hence your DL model is good to go, but there is a downside that the system keep on overshooting the global minimum and hence the DL model fails to converge.

so to summarize is effecting how fast or how slow your model converges and if the model are able to converge at all is what choosing the learning rate is all about.

Now, batch size from what I understand is a way to economically run the DL computation. DL computation is expensive, you need GPU, etc bla bla bla. Now instead of updating the parameters (weights and biases) every time every data points goes to the DL network, you want t kind of ‘cheat’ by just updating the parameters in ‘batch’. So, if you choose batch size = 128 in image classification model then you run 128 images into the DL network but you don’t update the parameters on every single image but on every batch of 128 images. The technically of this, they might average. So this need memory, because the computer need to remember all the computations of 128 images (feed forward) before it could in one go compute (average) all the 128 images updated parameters. So, you could say the batch size also effect time + cost not just memory since if you increase the batch size then you are saving time + cost, but you need more memory + your model might not be so good (because your batch size is too big!)

to summarize I don’t think there is any direct relationship at all between the two. But I think the two hyperparameters effect the cost + time of the DL model.

This is to me like very intuitive and experiential experiences. You kind of have to play with it to get a feeling how low, how big you want to adjust those hyper parameters. This is the job of DL engineer is to play with these hyperparameters and build intuition into it

(George Zhang) #1189

Ah, right! Now I think about it, it makes total sense. Thank you sir for correcting the mistake!

#1190

Thanks @diskandar I had/have similar understanding. One of the reason for asking was this paper https://arxiv.org/abs/1711.00489 which talks about don’t decay the learning rate but increase the batch size. So there is a relationship between them which was really clear to me.

(Asutosh) #1191

you mean differential learning rate?

(Ramesh Kumar Singh) #1192

I have some black corners (because of transformation) in my training dataset images. Will this affect the prediction (Jeremy mentioned in 1st class to look out for any black borders or out of ordinary content in the image) if yes then how to handle this. Any help is appreciated. Thanks.

#1193

What used to be called “differential learning rate” (in last year’s part 1 course) was renamed “discriminative learning rate” for last year’s part 2 course:

‘And he [Sebastian Ruder] had a few good ideas as we went along and so you should totally make sure you read the paper. He said “well, this thing that you called in the lessons differential learning rates, differential kind of means something else. Maybe we should rename it” so we renamed it. It’s now called discriminative learning rate.’

Source.

(A Santosh Kumar) #1194

I was just wondering why there is no mention of iterations in this graph?

The loss shown on Y-Axis must be at some iteration count?

(Asutosh) #1196

Oh, I was unware of this fact. I had only gone through the Part 1. Thanks for mentioning it.

(Asutosh) #1197

Mislabelling of data is basically a human manual error. I don’t think its possible to automatically identify the mislabelled data. How will a computer know that something has been mislabelled? Only by cross checking with the model output I guess. So there needs to be some ground truth based on which a decision could be taken. Just like you said, mislabelled data are more likely to be misclassified. So, that’s what we do. We look at the misclassified data and use our domain knowledge to arrive at a decision that whether this is mislabelled or not.
The new Jupyter Widget(used in Lesson 2) that deletes files can be used for deleting these mislabelled data which potentially affect model performance.

(Francisco Ingham) #1198

The model can only suggest which images are very confusing to classify since they don’t look like other stuff it has seen for that category. But it cannot decide if an image is misclassified since it does the ground truth, just what it looks like. In other words, how can the model decide between a mislabeled ragdoll cat which is actually a birman cat and a ragdoll that looks very much like a birman? It will probably not be very confident about any of the two.

(Santhosh) #1199

The loss printed is that of the predicted class and even the code as @joshfp has put forth confirms the same.

(Asutosh) #1200

This is the for the validation set.

So, yes.

And yes.

(Jose Roman) #1201

Looking at the MNIST example at the end of the lesson notebook, I saw that the unzipped version of the file had 3 items in it: labels.csv, a valid folder, and a train folder.

That being said, when creating the ImageDataBunch like this:
data = ImageDataBunch.from_csv(path, ds_tfms=tfms, size=28)

How does the default value of valid_pct = 0.2 come into the picture? Will the ImageDataBunch just have a validation set of 20% of what was in the validation folder or does the function look at all the images in both the valid and train folder in order to make the validation set?

Hi, i am trying to recreate lesson 1 with my own data, i am using ImageDataBunch.from_csv i an receiving a error

AttributeError Traceback (most recent call last) <ipython-input-81-670ca4990c14> in <module>() ----> 1 data = ImageDataBunch.from_csv(path,folder=‘bee_imgs’,suffix=’.png’)

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

(Yash Mittal) #1203

I got following error when I am using metric as fbeta:
RuntimeError: The size of tensor a (43) must match the size of tensor b (128) at non-singleton dimension 1

When I tried accuracy as metric it worked fine. tensor b is double the size of batch. When I tried with bs=64 I got that as 128. When I tried with 32 I got b as 64. I am classifying images into 43 categories.

(Yash Mittal) #1204