@safekidda I am thinking that initially the training was done for 128 sized image and later when the size is increased and the weights are being used from previous model, it helps the new model to learn fast and on top of the previous model. But I would like to get response from other learned individuals to comment and clear the doubts.
@joshfp I’d point out that the new channels would likely retain some of their spatial information, so at least some of the weights of the early layers will transfer.
Thanks. Yes, it used weights from previous model, but I’m questioning how they would be useful given the dimensions of the image have changed. If you think about it, the filters that we’re learned worked on small satellite images, so if everything suddenly got 4 times bigger how good would, say, an edge detector be? Only way I can see it working is if the original model had augmentation applied to work with zoomed images.
When we first setup the model training, example:
learn = create_cnn(data, models.resnet50, metrics=[error_rate, accuracy])
and then call the fit_one_cycle
learn.fit_one_cycle(5)
what is the learning rate used by the model? Is there a default learning rate somewhere?
In this lesson 3, progressive resizing is much talked about. At around 2:05:05 in the video, Jeremy says that we have trained with some size, 128x128, and now we can take the weights of this model and put in a new learner created with 256x256 images.
Here is my question:
Isn’t the number of weightsi in the model dependable on the input size? If it is, how can the weights of a 128 fit a 256 model?
The loss surface stays the same, but the optimal LR(s) does change over time (iterations) since the finder cannot see the whole surface. Just the small patch of it which it can see by running a brief fake training on the minibatch(es) it employs (and this is why it does not perform well if you bs is too small: it cannot even get to grab a decent local view). As you train, you move across the surface, this is the whole point of training. And as you move, the optimal LR changes.
Pay particulat attention as it talks about the averaged plot vs raw plot. It is one of the reasons for which you should never take the lowest loss (another is intrinsic (topologically): you want to stay wall away from the blow-up).
So, as a general rule of thumb: try and pick the point of maximum negative slope, as long as that point is reasonably away from the blowup and loss still decreases nicely around it.
Do tend to prefer higher LRs if possible: since they act as a regularization methods by themselves, they’ll help you in avoiding overfitting (see the relevant papers by the usual Smith about superconvergence).
For the convolutional layers, the number of weights depends on the filters size, so as long you don’t change the filters size, the number of weights will stay the same. The input size only affects the numbers of activations of the convolutional layer.
Then, in order the different number of activations coming out from the convolutional layers doesn’t affect the linear layers, a neat trick is used, called Adaptive Pooling Layer. These pooling layers are similar to standard pooling layers (max or average pool), but convert any size to the specified target size (that’s why are called adaptive). In this way, the number of input of the linear layers is always the same no matter the image’s input size. You can check the adaptive pooling layers by running learn.model.