Fastbook Chapter 6: overfitting to the validation set

Hi all,
I am a little confused by the following section of Chapter 6 of fastbook.

In this case, we’re using the validation set to pick a hyperparameter (the threshold), which is the purpose of the validation set. But sometimes students have expressed their concern that we might be overfitting to the validation set, since we’re trying lots of values to see which is the best. However, as you see in the plot, changing the threshold in this case results in a smooth curve, so we’re clearly not picking some inappropriate outlier. This is a good example of where you have to be careful of the difference between theory (don’t try lots of hyperparameter values or you might overfit the validation set) versus practice (if the relationship is smooth, then it’s fine to do this).

I must admit I belong to the aforementioned concerned students :smile:
I have thought about the paragraph for a little while but I cannot wrap my head around it.
How is "smoothness of relationship" related to "it is ok to fine-tune hyperparams on the validation set"?

I understand that if the metric-vs-hyperparam curve is smooth, by definition we don’t pick "some inappropriate outlier".
Even if this happens, we are still peaking into the “future” (the validation set is assumed to be an unbiased representation of what the real world looks like) and tweak our “past” (the model) to better align to it. Am I wrong?

Thanks, and happy hacking!

2 Likes

I second that concern and was also irritated by the statement.

In my opinion there is no way around this kind of “overfitting on the validation set”. However, it can be somehwat countered by cross-validating and should be proofed by testing on a separate holdout set (the difficult part is to resist the temptation to then overfit on the results on the holdout set).

1 Like

Not sure I understand the concern: the main point of having a validation set is to use it to pick your hyperparameters. In my mind the term “validation set” implies you have three sets: training, validation, and test. You should never just have training and validation.

If you only had two sets you would call them “training” and “test”

@wdhorton thanks for your comments!
I fully agree with you that, in the scenario you describe, e.g. the ideal one with 3 separate sets (training, validation and test) my concern does not hold, because the validation chunk is explicitly used for hyperparam tuning.

However, I think that fastai uses the Kaggle standards/nomenclature, e.g. training set for training, validation set for validating (not for model-tuning), and test set being unlabeled, so useless for any kind of performance-check.
Now, you might argue that, assuming what I have just said is true, any kind of model tweak which leads to better performance on the validation set (as per fastai definition) is wrongly stated as it involves leakage :slight_smile: (not necessarily overfitting though).

Have I been able to explain myself?

I often work with small datasets and I end up having just 2 sets (training and test) and performing my hyperparams tuning via cross-validation on the training set.

@sgugger is my understanding correct?

fastai2 allows for labelled test sets natively. Before it would automatically do it but now you need to pass is_labeled=True IIRC to your test_dl

Also I invite you to read Rachel’s How (and why) to create a good validation set here

A quote:

The underlying idea is that:

  • the training set is used to train a given model
  • the validation set is used to choose between models (for instance, does a random forest or a neural net work better for your problem? do you want a random forest with 40 trees or 50 trees?)
  • the test set tells you how you’ve done. If you’ve tried out a lot of different models, you may get one that does well on your validation set just by chance, and having a test set helps make sure that is not the case.

A key property of the validation and test sets is that they must be representative of the new data you will see in the future . This may sound like an impossible order! By definition, you haven’t seen this data yet. But there are still a few things you know about it.

1 Like

@muellerzr Thanks for the reply. I did not know fastai2 allowed for labelled test sets natively and that’s great!

I am indeed familiar with Rachel’s article and I do understand the importance of 3 separate sets.
My point remains the same though.
I cannot really understand this sentence from the book:

Specifically: if the relationship is smooth, then it’s fine to do this

1 Like

I actually wanted to ask something else, very related to this thread.
I hope this doesn’t sound too silly!

In almost any notebook Jeremy publishes, regardless of the DL application, he generally starts with a simple baseline model and then he improves on it.
At the end of the process, he gets to a (somewhat) very high accuracy (just picked a metric as an example) and he claims: “we reached XX% accuracy on this task”.
In some cases, such as ULMfit, it is even SOTA.

Now, isn’t this ill-posed by definition?
I mean, “we reached XX% accuracy on this task” is true on the validation set, not on a holdout test set. The claim is very likely to be false if we checked performance on the latter.

Does what I am saying make sense?
Thanks all for the stimulating discussion!

I think the most important single feature to be taken in consideration is the robustness of the model, and by that I mean how much the results change based on the chosen hyperparameters.

Back in the early days (and by early days I mean just some years ago) hyperparameter tuning greatly impacted the observed quality of your model, e.g. changing the depth of your tree classifier would totally change the results you get, so the concern about overfitting hyperparameters was very real

Over the years DL has become a very robust technique, this is easily seen with fastai, running learn.fine_tune can give you excellent results in almost any task, without changing any hyperparameter!!

There is also another view on this, take for example the top models on Imagenet, they all score very very similar! Some of the architectures used there are completely different to one another, but they achieve similar results.

Robustness is the key here, have you felt how hard it’s to actually overfit a model these days? I remember when I did the first version of the course (in keras) how my models were always overfitting. I remember spending days trying different dropout values and other regularization techniques. But this time around, I have not fiddled directly with that once.

And this goes in pair with what Jeremy mentioned as the “smoothness of the curve”, he’s saying that changing the hyperparemeters is not crazily changing our results, so there is no magic combination that gives us a magical 10% increase in our score. So it’s very unlikely that we’re going to get some hyperparameters that only works well on the validation set and performs poorly in real data.

Being robust means that even if we overfit our hyperparameters to our validation set, it’s not going to be a big deal, we are actually almost incapable of heavily overfitting hyperparameters (and to be clear, this is only true if our training pipeline is actually robust, and this is a combination of techniques and good practices that fastai implement. The one cycle policy, differential learning rates, heavy data augmentation, good model architectures, good initialisation, and everything that is happening behind the scenes helps with that). If you remove all of this infrastructure everything goes below the water.

I actually believe in a future (and I think Jeremy has a similar view) where we will not even need to change a single hyperparameter, this is the super convergence era, everything will just work out of the box, and fastai will rapidly walking to that future, it’s going to be amazing.

8 Likes

It’s just very very hard (I dare say impossible) to have a completely neutral dataset for research. Because you need to compare results, which means you have to train different models and evaluate them on the same dataset.

The only real way of having completely unbiased results is the way kaggle does it. Don’t show your score until the competition ends. And we do see entries that were in the top falling to the bottom because they were overfitting. So yes, saying “we reached XX% accuracy” is not completely correct.

But the way kaggle does it is not feasible for research, it would mean that each time we decide to look at our test set and make a change to our model we would need to create a new test set.

1 Like

Completely agree with every single word of yours.
Really helpful to have those discussions!