I just wanted to see if I am understanding this correctly. The way I understand it, this is basically building an ensemble model, by using a single model and training it to different amounts. So for VGG16, you might use VGG16 @ 5 epochs, 10 epochs, and 20 epochs. Then you would use those three models together to get the overall answer. If I am way off the mark, can somebody give me a paper that explains it? Here is the paper I have been consuming: https://arxiv.org/pdf/1704.00109.pdf
Ok, after looking into this more, I think I was a little off with my initial assumption. My new understanding is that it using learning rate manipulation to come up with the different models. So a higher learning rate is used, then a really small one is used to get to a local minimum (this would be one cycle I think). This is then repeated for however many times you can/want to in order to get an ensemble without having to train each model completely differently.
Yes exactly, the technique is an extension of SGD w Warm Restarts first proposed in this paper:
https://arxiv.org/abs/1608.03983
In my experiments it works really great. And makes cool charts.
Snapshot learning basically means you save the model weights every time the learning rate restarts, and ensemble this group of â€śmini modelsâ€ť to make a final prediction.
So the steps to do this with VGG16 would be

Create VGG16 model

Train a few epochs

reduce training rate significantly

Train a lot of epochs to get it to settle into a minimum

Save Weights

Increase Training rates back up and train a few times to get it out of that hole

repeat steps 35

Once you have enough models built from VGG16, somehow give each of these a vote

Determine final results based on the vote of each individual model.
Is this as good as ensembling VGG16, Resnet, Inception together? I would think it would still be worse than a real ensemble model because they may be using different features to come to certain coclusions where in the Cyclic Cosign Annealing, you are always looking at the same model, just with slightly different answers.
Yep itâ€™s not as good naturally, but itâ€™s way easier and can be used to improve all models. Even without ensembling the cyclical learning part is a big win Typically converging faster and without hand tuning.
Itâ€™s as simple as calling the SnapshotLR method in this file every mini batch iteration. No need to do any manual adjusting.