Is learning rate finder works for Adam?

mefmef · November 17, 2017, 2:20pm

Does learning rate finder technique work for more sophisticated optimizers like Adam?

What can I do to use learning rate finder in my own Pytorch model?

radek · November 17, 2017, 2:34pm

The answer to the first question - unless I am completely wrong - is that it should work regardless. Problem is that Adam does more optimization based on info from earlier epochs than SGD with momentum that we seem to be using here. Still, momentum accumulates across epochs and it seems to work…

There is no simple answer to the 2nd question apart from looking at the fastai lib and thinking what parts to reuse or implementing this from scratch.

bushaev · November 17, 2017, 2:54pm

fast.ai library actually uses Adam behind the scenes for optimization. Jeremy mentioned it in the second lecture. So yes, it will work.

miguel_perez · November 17, 2017, 3:27pm

Yes, I also remember Jeremy saying that Adam is behind scenes. On Adam my question is now…can we actualy choose not to use Adam and use plain vanilla SGD? If yes, how?

bushaev · November 17, 2017, 3:42pm

Yes, if using ConvLearner.pretrained, you need to pass opt_fn=SGD. It then passes it to the initialization of ConvLearner and then passes again to initialization of learner. Though I’ve not tried, it might not work

bushaev · November 17, 2017, 3:43pm

It look like the default is SGD_Momentum or am I wrong ? Here’s the code from learner.py

    self.opt_fn = opt_fn or SGD_Momentum(0.9)

binga · November 17, 2017, 3:44pm

Are you sure this is the case? This line suggests the default optimization function is SGD_Momentum.

github.com

fastai/fastai/blob/master/fastai/learner.py#L28


    self.sched=None
    self.wd_sched = None
    self.clip = None
    self.opt_fn = opt_fn or SGD_Momentum(0.9)
    self.tmp_path = os.path.join(self.data.path, tmp_name)
    self.models_path = os.path.join(self.data.path, models_name)
    os.makedirs(self.tmp_path, exist_ok=True)
    os.makedirs(self.models_path, exist_ok=True)
    self.crit,self.reg_fn = None,None


@classmethod
def from_model_data(cls, m, data):
    self = cls(data, BasicModel(to_gpu(m)))
    self.unfreeze()
    return self


def __getitem__(self,i): return self.children[i]


@property
def children(self): return children(self.model)

miguel_perez · November 17, 2017, 4:05pm

I remember Jeremy mentioning it, haven’t analyzed the code. (beginner in Python, really limited )

The comment about Adam being used is also in the lesson2 lecture notes here, (but could be a mistake).

DeepLearning-LecNotes2

ramesh · November 17, 2017, 5:15pm

@mefmef -

Shameless plug - I used learning rate finder on CIFAR10 using a Custom PyTorch model -

jeremy · November 17, 2017, 5:16pm

Huh - guess I misremembered. I think at some point I had Adam as default, but looks like I switch to SGD with momentum. How embarassing!

metachi · November 17, 2017, 5:20pm

I noticed that when I was writing lesson 1 in Keras. On the bright side, I think this is a great excuse to talk about Adam and other optimizers at some point

miguel_perez · November 17, 2017, 6:06pm

Challenge accepted!

I came across this trying to better understand Adam, it looks like a serious flaws not only of Adam but of adaptative learning rates. @jeremy maybe is related with your decision to use SGD as default? https://arxiv.org/abs/1705.08292
(I just discovered Adam, no clue about how big this problem is in practice)

jeremy · November 17, 2017, 6:23pm

Well spotted. However I find these theoretical papers using synthetic datasets nearly always useless in practice. A paper from a few days ago shows we can use Adam to get SoTA results: [1711.05101] Decoupled Weight Decay Regularization . I’d like to ideally implement that before I make Adam the default, since without it, Adam can give slightly sub-optimal results.

radek · November 17, 2017, 9:34pm

The is amazing! Great conversation ITT and the paper @jeremy shared is of the scales!

rpathak · November 18, 2017, 4:26pm

Came across this paper published recently. Yet to dig into it. Has anyone read it? What experts say here -

Backprop without Learning Rates Through Coin Betting
Francesco Orabona and Tatiana Tommasi
https://arxiv.org/abs/1705.07795

jeremy · November 18, 2017, 8:05pm

Their experiments are simply dreadful - the baselines they compare to are just so so bad. If I’m reading it correctly, 25% error for CIFAR10. My best model recently has been around 4% error.

yinterian · November 19, 2017, 1:27am

You can can use Adam by passing it in the following way:

learn = ConvLearner.pretrained(f_model, data,  opt_fn=optim.Adam)

It works better for me as well.

jamesrequa · November 21, 2017, 6:20pm

woah just switched to adam and it works a lot better for me too at least on this dataset I’m working with…the downside is its definitely a lot more volatile than sgd. I guess in the end it really depends on the data/problem which optimizer to choose.

bremen79 · January 30, 2023, 10:39pm

I just randomly found this ancient thread: I am the first author of that paper.

I am a bit saddened by the snide that we theory people get from applied deep learning people…

Instead, I think it would be great if applied people would start looking at parameter-free algorithms!

For example, despite the questionable comment on my experiments, my exact code above was used to win a Kaggle competition years ago, see GitHub - Arturus/kaggle-web-traffic: 1st place solution

Also, Facebook research people are currently looking at variant of the same idea, see “Learning-Rate-Free Learning by D-Adaptation”, Aaron Defazio, Konstantin Mishchenko, ArXiv 2023