Lesson 1: Tradeoff between Learning Rate Finder quickness and optimal learning rate

bbli · February 6, 2018, 6:53am

Hi everyone,
After watching the first video lecture and working through the jupyter notebook, I have a couple of questions:

This method of finding the “optimal” learning rate by running the model for a bit is different from the way I was taught to find values for hyperparameters. Namely, I would train the model all the way to completion on a variety of hyperparameter settings, and use the validation set to decide which model I would keep. Clearly, the method we used in lesson 1 is faster. So what are its disadvantages?
Building on question 1, when are you allowed to tune a hyperparameter by itself?
In section 9 of the notebook regarding the steps to train a world-class image classifier, why do we train first without the augmented data, and then with it? Can we just combine them?

radek · February 6, 2018, 9:32am

Ad #1 - training a deep learning model can take weeks. Also, the ideal training rate is very situation dependent, meaning it is a piece of information that can be derived locally with this method and that has local significance.

You could potentially test out various training regimes where you would say you start with this LR and decay it exponentially or whatnot, but if you are supervising the training manually you probably would be better off using the lr_finder.

With regards to 2 and 3 (and somewhat 1) there are some best practices that seem to work, but you are free to try whatever seems to make sense to you. If you feel a grid search is applicable to your problem, give it a go.

The way I see it, the information is often presented in coherent chunks to teach us something. But there is no single recipe uniquely applicable. I would venture a guess that we do data augmentation down the road because this just makes it simpler to show us concept one by one. Also, it is a bit more computationally expensive, since we need to run the images through all the layers vs saving the activations of the conv part of the network and training on them - quite a nice way of driving that point!

As to at what point to use data augmentation, I do not know. I came across some people training a model up to some level and only then starting to use data augmentation. Probably training with it from the beginning would be okay as well. But as you progress in the course you will discover there are other considerations to take into account For instance, whether you are overfitting or not. This will be the driving factor in many regards.

bbli · February 7, 2018, 8:07am

Ok, thanks for the response. So I guess just try a basic setup, and if results arn’t good enough, then make adjustments?

punnerud · February 7, 2018, 8:27am

On the side. Could you change the title so it is more specific?
Suggestion: “Learning rate finder - Finder speed vs Optimal learning rate (Lession 1)”

radek · February 7, 2018, 9:11am

Yes, pretty much Also, as you continue to watch lectures you will start getting more of a feel for how you can approach things - Jeremy shares a lot of best practices.

bbli · February 8, 2018, 12:22am

@punnerud yeah, I can do that. So in the future, would it be better to separate my questions into different posts so the title for the post will be more specific?