[Paper discuss] A disciplined approach to neural network hyper parameters by Leslie Smith

Hmm, this is interesting, I’ve never thought about it. I guess he is talking about per-epoch time then. I like to think about those types of problems in the budget terms: for a given problem and hardware resources, what are the hyperparameters (from learning rate to net architecture) that give you the best performance for a given amount of time? In that formulation, learning rate doesn’t change computational time, while batch size does (you can do more or less forward/backward passes for a given time).

Learning rate does change computational time in terms of budget. If my learning rate is very small it might require me to run more epochs. Thus my p2 instance would need to run for longer period of time. If it is large p2 instance might need to run for smaller period of time. Am I wrong?

I’ve meant that in the “budget point of view” you don’t run the training until a certain performance is achieved (e.g. I want >90% accuracy, and that would require to train a model for at least 21 hours), but vice versa: you have a fixed time interval, and you want to get the smallest validation loss in during that interval (e.g. I only have 12 hours to spare, and the best accuracy I can achieve is 85%).

No we’re not actually overfitting unless the test loss is increasing. Having better train than test doesn’t mean you’re overfitting. At the start of the graph you can see the test loss increases for a while.

5 Likes

He means computational time per epoch. That could have been made more clear in the paper!

1 Like

Ok. That explains it. So far (since Part 2017) I thought that as soon as we see validation loss > train loss we have started to overfit. That was confusing for me while watching the language models too. I thought we are overfitting always why are we continuing? That becomes a separate discussion then which I will do separately after thinking about the lectures a bit.

1 Like

Just remember: our goal is a model that does as well on the test set as we can make it. There’s no reason that this would happen when the training set error is higher than the test set - in fact, that would be most unusual.

3 Likes

In Part 1 didn’t we stop after the epoch when training set loss got smaller compared to validation set loss? At that point you mentioned now we are overfitting. We did not stop at the point where the validation loss increased. Also just to be precise when we say overfitting in general we mean overfitting on the training set, correct?

Question, did you mean validation set in the above statement? Because on a on-going basis we compare train with validation rather than test set. test is kept till the end as per the separation of train/valid/test.

This is also a confusing thing in Leslie’s paper (he does mention it explicitly but still confusing) is that he keeps on using test or validation interchangeably. If he is doing the hyperparameter tuning on a test set then all of my understanding of train/test/validation is wrong. Or the paper is confusing.

2 Likes

No I certainly hope I never gave that impression - I’m sorry if I did!

It’s pretty common - I do the same, unless it’s a case where it matters. Generally you can assume people are talking about a validation set, unless it’s obvious otherwise from the context.

3 Likes

I am watching the intro to machine learning Lesson 2. At https://youtu.be/blyXCk4sgEg?t=1434 you mention that we are overfitting when the the score for validation set is less than the score for training set. So is the intuitive way of when we are overfitting different in deep learning compared to Random forests?

Sounds like I may have mis-spoken :frowning: But yes the intuition is very different - for DL having a much lower training loss is pretty common.

3 Likes

I really enjoyed the paper. Wanted to share this brief interview with Leslie N Smith on his research.

15 Likes

Wow cool! How did that happen?

2 Likes

I occasionally interview researchers through my affiliation with MLConf. This time, I learned I could interview whomever I wished, and Leslie Smith was the first person to come to mind. Also, I had googled him, and there was limited info on his career, so I was interested in learning more. :slight_smile:

5 Likes

I thought this exact same thing. That’s good to know.

This is such a good paper. I love how practical it is and straightforward it is. I wish more papers had Remark sections! I completely missed this paper initially, but I will definitely be reading more of Leslie Smith papers going forward. That interview was awesome as well. Nice job Reshama!

One takeaway I am seeing from the paper is about weight decay:

If you have no idea of a reasonable value for weight decay, test 10−3, 10−4, 10−5, and, 0. Smaller
datasets and architectures seem to require larger values for weight decay while larger datasets and
deeper architectures seem to require smaller values. Our hypothesis is that complex data provides
its own regularization and other regularization should be reduced.

4 Likes

This paper is a gem for the practitioner. Despite had already seen many of those suggestions in the Fastai videos, it was good to understand a lot of the rational behind on why this or that works and to give attention to some ideas that are less discussed in the videos (as all the discussion about weight decay as @KevinB said) .

One question that came to my mind is: If we should use the largest batch size our GPU memory allows, and given that the size of the network does not change during training, one could calculate the maximum (optimal) batch size considering GPU memory, network size and frozen layers and never running out of memory. Right? :thinking:

I’ve been wondering about optimum batch size as well. Here’s a paper I just came across on how batch size affects generalization: https://arxiv.org/pdf/1705.08741.pdf

My understanding after a first read through is that larger batches can result in worse generalization because there are fewer gradient updates per epoch. The paper makes the case for adjusting the number of training epochs by the batch size - basically measuring length of training by the number of weight updates rather than the number of epochs.

1 Like

Leslie’s paper on batch size:

Small batch sizes have been recommended for regularization effects (Wilson & Martinez, 2003) and others have shown there to be an optimal batch size on the order of 80 for Cifar-10 (Smith & Le, 2017). Contrary to this early work, this Section recommends using a larger batch size when using the 1cycle learning rate schedule, which is described in the above.

What I understood is that, although batch size can be used for regularization, using a large learning rate is a better regularization method, as big learning rates in 1cycle help you achieve convergence faster than small batch sizes (which causes the opposite effect).

My point is that if we are to use 1cycle, why not build a function that gives as the optimum batch size (the biggest that fits your gpu memory) automatically? This would prevent a lot of guess and error during training.

2 Likes

Do the findings in Leslie’s paper apply to architectures other than convolutional network? For example, is the recipe the same for a model like Transformer?