Relationship between cycle_mult and learning rate

(K Doherty) #1

Hi, I’ve been working on building an image classifier on a personal dataset with the resnext101 architecture. The generic recipe prescribed in the first set of lessons advises:
fit.learn(lr, 3, cycle_len=1, cycle_mult=2)

I’ve found this to work fairly well when training frozen models, but unfrozen training with those tuning parameters overfits quite quickly, usually by the 3rd epoch. Would decreasing the learning rate help mitigate this? It’s a slow trial and error, which I am working through presently.


Resnext101 is really big - you might want to start with something smaller, like resnet34 or resnet50. Even if those archs will not be the ones you use ultimately, you will start getting a feel for the problem and whether a larger arch might be needed, etc.

In general, when you have an overfitting problem, you might want to increase and not decrease the learning rate.

I suspect you might be training on a really small dataset - in such a case you might not want to unfreeze all the layers. Might be that just training the last segment, the classifier, is the way to get best results. Or maybe unfreezing the middle section would help.

All of this is very situation dependent.

(K Doherty) #3

Hi, thanks for getting back to me. This is an interesting question of what architectures are appropriate for what data sizes you have. I have been experimenting with a dataset of 24320 images with 76 equally balanced classes, 20% of which I have been using for validation. I have also retained a separate test set of 6080 images, which I haven’t even tried to predict to yet.

I have tried to retrain all of the architectures you’ve mentioned on my data, and found the ResNext101_64 to give approximately ~5% higher accuracy as compared to the other two, though you really have to intensely manage the training process to ensure it does not overfit. It also takes far longer to train.

What advantages would you ascribe to the resnet34 and resnet50 architectures? Also, would you be able to share the distinguishing characters of resnet architectures vs. resnext?


The reason I suggested a smaller arch was that you mentioned overfitting - assumed you might be working with a small dataset. Larger archs are easier to overfit with.

Resnets seem to train differently but it is hard to quantify. They also seemed to train better for me than resnets with higher lrs which might be useful for avoiding overfitting.