Is learning rate finder works for Adam?


(hafez) #1

Does learning rate finder technique work for more sophisticated optimizers like Adam?

What can I do to use learning rate finder in my own Pytorch model?


#2

The answer to the first question - unless I am completely wrong - is that it should work regardless. Problem is that Adam does more optimization based on info from earlier epochs than SGD with momentum that we seem to be using here. Still, momentum accumulates across epochs and it seems to work…

There is no simple answer to the 2nd question apart from looking at the fastai lib and thinking what parts to reuse or implementing this from scratch.


(Vitaly Bushaev) #3

fast.ai library actually uses Adam behind the scenes for optimization. Jeremy mentioned it in the second lecture. So yes, it will work.


(Miguel Perez Michaus) #4

Yes, I also remember Jeremy saying that Adam is behind scenes. On Adam my question is now…can we actualy choose not to use Adam and use plain vanilla SGD? If yes, how?


(Vitaly Bushaev) #6

Yes, if using ConvLearner.pretrained, you need to pass opt_fn=SGD. It then passes it to the initialization of ConvLearner and then passes again to initialization of learner. Though I’ve not tried, it might not work :slight_smile:


(Vitaly Bushaev) #7

It look like the default is SGD_Momentum or am I wrong ? Here’s the code from learner.py

    self.opt_fn = opt_fn or SGD_Momentum(0.9)

(Phani Srikanth) #8

Are you sure this is the case? This line suggests the default optimization function is SGD_Momentum.


(Miguel Perez Michaus) #9

I remember Jeremy mentioning it, haven’t analyzed the code. (beginner in Python, really limited :sweat:)

The comment about Adam being used is also in the lesson2 lecture notes here, (but could be a mistake).

DeepLearning-LecNotes2

image


(Ramesh Sampath) #10

@mefmef -

Shameless plug - I used learning rate finder on CIFAR10 using a Custom PyTorch model -


(Jeremy Howard) #11

Huh - guess I misremembered. I think at some point I had Adam as default, but looks like I switch to SGD with momentum. How embarassing!


(Jeff Lee) #12

I noticed that when I was writing lesson 1 in Keras. On the bright side, I think this is a great excuse to talk about Adam and other optimizers at some point :slight_smile:


(Miguel Perez Michaus) #13

Challenge accepted! :grinning:

I came across this trying to better understand Adam, it looks like a serious flaws not only of Adam but of adaptative learning rates. @jeremy maybe is related with your decision to use SGD as default? https://arxiv.org/abs/1705.08292
(I just discovered Adam, no clue about how big this problem is in practice)


(Jeremy Howard) #14

Well spotted. However I find these theoretical papers using synthetic datasets nearly always useless in practice. A paper from a few days ago shows we can use Adam to get SoTA results: https://arxiv.org/abs/1711.05101 . I’d like to ideally implement that before I make Adam the default, since without it, Adam can give slightly sub-optimal results.


#15

The is amazing! Great conversation ITT and the paper @jeremy shared is of the scales! :slightly_smiling_face:


(Rahul Pathak) #16

Came across this paper published recently. Yet to dig into it. Has anyone read it? What experts say here -

Backprop without Learning Rates Through Coin Betting
Francesco Orabona and Tatiana Tommasi
https://arxiv.org/abs/1705.07795


(Jeremy Howard) #17

Their experiments are simply dreadful - the baselines they compare to are just so so bad. If I’m reading it correctly, 25% error for CIFAR10. My best model recently has been around 4% error.


(yinterian) #18

You can can use Adam by passing it in the following way:

learn = ConvLearner.pretrained(f_model, data,  opt_fn=optim.Adam)

It works better for me as well.


(James Requa) #19

woah just switched to adam and it works a lot better for me too at least on this dataset I’m working with…the downside is its definitely a lot more volatile than sgd. I guess in the end it really depends on the data/problem which optimizer to choose.