Cyclic Cosign Annealing

KevinB · September 25, 2017, 1:19am

I just wanted to see if I am understanding this correctly. The way I understand it, this is basically building an ensemble model, by using a single model and training it to different amounts. So for VGG16, you might use VGG16 @ 5 epochs, 10 epochs, and 20 epochs. Then you would use those three models together to get the overall answer. If I am way off the mark, can somebody give me a paper that explains it? Here is the paper I have been consuming: https://arxiv.org/pdf/1704.00109.pdf

KevinB · September 25, 2017, 3:43am

Ok, after looking into this more, I think I was a little off with my initial assumption. My new understanding is that it using learning rate manipulation to come up with the different models. So a higher learning rate is used, then a really small one is used to get to a local minimum (this would be one cycle I think). This is then repeated for however many times you can/want to in order to get an ensemble without having to train each model completely differently.

brendan · September 26, 2017, 4:09pm

Yes exactly, the technique is an extension of SGD w Warm Restarts first proposed in this paper:

https://arxiv.org/abs/1608.03983

In my experiments it works really great. And makes cool charts.

Snapshot learning basically means you save the model weights every time the learning rate restarts, and ensemble this group of “mini models” to make a final prediction.

KevinB · September 26, 2017, 4:29pm

So the steps to do this with VGG16 would be

Create VGG16 model
Train a few epochs
reduce training rate significantly
Train a lot of epochs to get it to settle into a minimum
Save Weights
Increase Training rates back up and train a few times to get it out of that hole
repeat steps 3-5
Once you have enough models built from VGG16, somehow give each of these a vote
Determine final results based on the vote of each individual model.

Is this as good as ensembling VGG16, Resnet, Inception together? I would think it would still be worse than a real ensemble model because they may be using different features to come to certain coclusions where in the Cyclic Cosign Annealing, you are always looking at the same model, just with slightly different answers.

brendan · September 26, 2017, 5:22pm

Yep it’s not as good naturally, but it’s way easier and can be used to improve all models. Even without ensembling the cyclical learning part is a big win Typically converging faster and without hand tuning.

It’s as simple as calling the SnapshotLR method in this file every mini batch iteration. No need to do any manual adjusting.

github.com

bfortuner/pytorch-kaggle-starter/blob/master/training/learning_rates.py

import math
import operator
import copy


def set_learning_rate(optimizer, lr):
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr


def get_learning_rate(optimizer):
    return optimizer.param_groups[0]['lr']



class LearningRate():
    def __init__(self, initial_lr, iteration_type):
        self.initial_lr = initial_lr
        self.iteration_type = iteration_type #epoch or mini_batch

This file has been truncated. show original