Is learning rate finder works for Adam?

Does learning rate finder technique work for more sophisticated optimizers like Adam?

What can I do to use learning rate finder in my own Pytorch model?

2 Likes

The answer to the first question - unless I am completely wrong - is that it should work regardless. Problem is that Adam does more optimization based on info from earlier epochs than SGD with momentum that we seem to be using here. Still, momentum accumulates across epochs and it seems to work…

There is no simple answer to the 2nd question apart from looking at the fastai lib and thinking what parts to reuse or implementing this from scratch.

fast.ai library actually uses Adam behind the scenes for optimization. Jeremy mentioned it in the second lecture. So yes, it will work.

1 Like

Yes, I also remember Jeremy saying that Adam is behind scenes. On Adam my question is now…can we actualy choose not to use Adam and use plain vanilla SGD? If yes, how?

Yes, if using ConvLearner.pretrained, you need to pass opt_fn=SGD. It then passes it to the initialization of ConvLearner and then passes again to initialization of learner. Though I’ve not tried, it might not work :slight_smile:

1 Like

It look like the default is SGD_Momentum or am I wrong ? Here’s the code from learner.py

    self.opt_fn = opt_fn or SGD_Momentum(0.9)
3 Likes

Are you sure this is the case? This line suggests the default optimization function is SGD_Momentum.

1 Like

I remember Jeremy mentioning it, haven’t analyzed the code. (beginner in Python, really limited :sweat:)

The comment about Adam being used is also in the lesson2 lecture notes here, (but could be a mistake).

DeepLearning-LecNotes2

image

@mefmef -

Shameless plug - I used learning rate finder on CIFAR10 using a Custom PyTorch model -

1 Like

Huh - guess I misremembered. I think at some point I had Adam as default, but looks like I switch to SGD with momentum. How embarassing!

2 Likes

I noticed that when I was writing lesson 1 in Keras. On the bright side, I think this is a great excuse to talk about Adam and other optimizers at some point :slight_smile:

1 Like

Challenge accepted! :grinning:

I came across this trying to better understand Adam, it looks like a serious flaws not only of Adam but of adaptative learning rates. @jeremy maybe is related with your decision to use SGD as default? https://arxiv.org/abs/1705.08292
(I just discovered Adam, no clue about how big this problem is in practice)

1 Like

Well spotted. However I find these theoretical papers using synthetic datasets nearly always useless in practice. A paper from a few days ago shows we can use Adam to get SoTA results: https://arxiv.org/abs/1711.05101 . I’d like to ideally implement that before I make Adam the default, since without it, Adam can give slightly sub-optimal results.

2 Likes

The is amazing! Great conversation ITT and the paper @jeremy shared is of the scales! :slightly_smiling_face:

Came across this paper published recently. Yet to dig into it. Has anyone read it? What experts say here -

Backprop without Learning Rates Through Coin Betting
Francesco Orabona and Tatiana Tommasi
https://arxiv.org/abs/1705.07795

Their experiments are simply dreadful - the baselines they compare to are just so so bad. If I’m reading it correctly, 25% error for CIFAR10. My best model recently has been around 4% error.

You can can use Adam by passing it in the following way:

learn = ConvLearner.pretrained(f_model, data,  opt_fn=optim.Adam)

It works better for me as well.

2 Likes

woah just switched to adam and it works a lot better for me too at least on this dataset I’m working with…the downside is its definitely a lot more volatile than sgd. I guess in the end it really depends on the data/problem which optimizer to choose.

1 Like

I just randomly found this ancient thread: I am the first author of that paper.

I am a bit saddened by the snide that we theory people get from applied deep learning people…

Instead, I think it would be great if applied people would start looking at parameter-free algorithms!

For example, despite the questionable comment on my experiments, my exact code above was used to win a Kaggle competition years ago, see GitHub - Arturus/kaggle-web-traffic: 1st place solution

Also, Facebook research people are currently looking at variant of the same idea, see “Learning-Rate-Free Learning by D-Adaptation”, Aaron Defazio, Konstantin Mishchenko, ArXiv 2023