Cyclic Cosign Annealing

I just wanted to see if I am understanding this correctly. The way I understand it, this is basically building an ensemble model, by using a single model and training it to different amounts. So for VGG16, you might use VGG16 @ 5 epochs, 10 epochs, and 20 epochs. Then you would use those three models together to get the overall answer. If I am way off the mark, can somebody give me a paper that explains it? Here is the paper I have been consuming:

Ok, after looking into this more, I think I was a little off with my initial assumption. My new understanding is that it using learning rate manipulation to come up with the different models. So a higher learning rate is used, then a really small one is used to get to a local minimum (this would be one cycle I think). This is then repeated for however many times you can/want to in order to get an ensemble without having to train each model completely differently.

Yes exactly, the technique is an extension of SGD w Warm Restarts first proposed in this paper:

In my experiments it works really great. And makes cool charts.

Snapshot learning basically means you save the model weights every time the learning rate restarts, and ensemble this group of “mini models” to make a final prediction.

So the steps to do this with VGG16 would be

  1. Create VGG16 model

  2. Train a few epochs

  3. reduce training rate significantly

  4. Train a lot of epochs to get it to settle into a minimum

  5. Save Weights

  6. Increase Training rates back up and train a few times to get it out of that hole

  7. repeat steps 3-5

  8. Once you have enough models built from VGG16, somehow give each of these a vote

  9. Determine final results based on the vote of each individual model.

Is this as good as ensembling VGG16, Resnet, Inception together? I would think it would still be worse than a real ensemble model because they may be using different features to come to certain coclusions where in the Cyclic Cosign Annealing, you are always looking at the same model, just with slightly different answers.

Yep it’s not as good naturally, but it’s way easier and can be used to improve all models. Even without ensembling the cyclical learning part is a big win Typically converging faster and without hand tuning.

It’s as simple as calling the SnapshotLR method in this file every mini batch iteration. No need to do any manual adjusting.

1 Like