About Overfitting in DL (indulge me..)

I assume you all already know what Jeremy, and many other prominent scholars teach us about overitting in neural networks. It may be summarized as follows: As long as your accuracy keeps improving, don’t care about your TL getting below (even significantly below) your VL (see below for a more in depth discussion if you forgot the lesson).

Of course, I strongly believe this is true. But let us assume that we have two models trained over the same data, achieving the same level of accuracy/error rate.

One model, though, has its TL still above the VL (or even below, but not by much).
On the other hand, the second model got a TL significantly below the VL, or even close to zero. Maybe since you used a bigger network, maybe since you meddled with dropout and stuff, maybe since we kept training a bit longer than necessary.

What I want to ask is: will the first model have a better generalization capacity?.

For example, if we use that model in production over data which possess quite a different underlying distribution w.r.t the train/validation data, will it perform better than our second model?

I’m asking this since I worked over “real” data with that philosophy in mind, achieving awesome accuracies with models having their TL a lot below the VL. As the model was used against test images in the same domain, but with different shooting conditions, that accuracy worsened a lot.

(*) Long version:

So the only thing that tells you that you’re overfitting is that the error rate improves for a while and then starts getting worse again. You will see a lot of people, even people that claim to understand machine learning, tell you that if your training loss is lower than your validation loss, then you are overfitting. As you will learn today in more detail and during the rest of course, that is absolutely not true .
Any model that is trained correctly will always have train loss lower than validation loss.
That is not a sign of overfitting. That is not a sign you’ve done something wrong. That is a sign you have done something right. The sign that you’re overfitting is that your error starts getting worse, because that’s what you care about. You want your model to have a low error. So as long as you’re training and your model error is improving, you’re not overfitting.

1 Like

Interesting question! Probably the fast.ai way is try both of them and ensemble or choose the one that have better performance on your Test data (we’re assuming that your Test distribution is different from Train and Valid, otherwise both models should performs good) .
…Eventually you’ve to guess the test distribution!

BTW to me this is Jeremy’s best quote of DL3-Part1 :wink:


Thanks for you reply @ste! Yes, it is the most reasonable approach. Still, I got discrepancies among my experiments, and was unable to draw solid conclusions… :confused:

1 Like

I think the problem is the following:

Your train/valid and your test set come from different distributions. That means the validation set does not help you in that case to judge overfitting in relation to your test data (distribution).

Hence, you should split your data as follows:

Train / Train Dev (for validation data during development)


Test Dev (for validation data of your test set during development) / Test

Then you can be sure that you have a validation against your training data distribution and a validation against your test data distribution.


It seems to be a good idea. Will do a bit of experimenting and let you know. Thanks!

your summary is very useful.