What's a well fitted model ? Train and validation error


Hi there,
Let’s say I have split my dataset in three : 60% train, 20% val, 20% test.

An overfitted model has its validation error higher than its train error.
One of the usual “cure” in that case would be to add for example dropout.
Indeed that will usually raise the train error (calculated with droupout) and later, after a few iterations, diminish the validation error.

My question is general : to have a well fitted model, should I aim for a train error (calculated with droupout) nearly equal to the validation error ? Or a train error (calculated without droupout) nearly equal to the validation error ?

From what I have seen before, I would opt for answer 1. Then I can stop when training error (calc. with droupout) is for example 0.22 and validation error 0.23. The problem is : for my final well fitted model (that is, calculation made with no droupout), the error the validation and on the train is in that case very different (e.g. on dog breeds : 0.08 (train), 0.23 (val)).
Is it normal ? Is my model really a good fit ?

Thanks a lot for your help, “fasters” !

(RobG) #2

You shouldn’t see significant difference in test using weights that have been trained with or without dropout, assuming test/inference data is similar to the train data. Dropout is but one technique that can allow more rapid training toward weights which are suitable for inference.


Thanks for your answer Rob,
I think I did not explain myself well enough.

You’re saying “Dropout is but one technique that can allow more rapid training toward weights which are suitable for inference.”
True indeed, that’s my aim.
And usually, the way to verify that my inferences will be good, is to verify that the train and validation error are nearly the same for my model.
My question is : in that verification should it be the train error calculated with the model where we apply effectively dropout on activations in the feedforward pass, or train error calculated with model where I disable droupout ? Using the model “test time” version if you wish.
(Note : I’m not talking of the training itself, where I do apply dropout on each pass. I’m talking of the final calculation of the train error, where I want to compare it the validation error, which, of course, is calculated with droupout disabled.)

Generally speaking : we’re calculating the validation error with dropout disabled. Shouldn’t we therefore compare it to the train error calculated with dropout disabled ?


Here is a detailed example.
At the end of the training on dogbreeds dataset of a resnet 34 with dropout on the last layers, I have the following :

  • test_error : test loss (on kaggle) (test-time data augmentation, no dropout) : 0.20724
  • val_error_da_nd : val loss (test-time data augmentation, no dropout) : 0.20363873946921965
  • val_error_nda_nd: val loss (no test-time data augmentation, no dropout) : 0.20163 (lower than with augmentation: weird, but ok, may be a very unlikely event that however appeared here due to randomness…)
  • train_error_da_d : final train loss (training-time data aug., with dropout) : 0.195593
  • train_error_da_nd : final train loss (training-time data aug., without dropout) : 0.07514032032330985 (I managed to calculate this one digging into fast.ai library the predict_with_targs() func.)

I’ll make the hypothesis here that training-time augmentation and test-time augmentation do not change much results (which is supported by the rather low spread between val_error_da_nd and val_error_nda_nd)

From there, we see that without dropout, the spread between the error on the train and the error on the validation is high. (3x between the two).
Does my model overfit ? Unsure. I would say it does not as test and validation error are very close one to the other. Also, the loss calculted on train with dropout is very near these two.
Therefore the answer would be : “well fit model” = when train loss with dropout nearly equals to val loss without dropout, nearly equals to test loss without dropout. Is that it ?

@jeremy : help very much appreciated :slight_smile:

Edit : Actually, after a couple of hours of thinking, I think the answer is “well fit model” = when train loss without dropout nearly equals val loss without dropout, nearly equals test loss without dropout. We should not be comparing the error on different models that calculate different ways (with dropout sometimes, without at other times). Hence, in the end, there is still some overfitting on my results above I think, so the model could still be tuned up some more by adding more data or dropout.

(RobG) #6

Your edit is correct, dropout has nothing to do with it. We use various hyper parameters to train the fastest, and to the best minima we can find. If you got there with or without dropout, fabulous.

What matters is that toward the end of training, you did not cross the rubicon of over fitting. This is when validation loss is not decreasing much/any, but training loss is, and it is below validation loss. With fastai and images, this is usually around the same time (ie when train is less than val) but with language you can have a lower train than val loss, but both keep falling, so we are not over fitting until val loss slows.

If you’re on a kaggle comp you can see this with your score. If not a comp, you can hold some data back as ‘test’.


Thanks for your reply rob,
But again, I think I might not have explained myself well.
Suppose I’m training a model with dropout.
You’re saying : «stop the training when train error falls while val error does not».
Yes, but a train error calculated how ?
There are two ways to calculate that train error, either by activating dropout, or by not activating it in the feedforward pass. (Model has dropout during training, but when calculating its error we could deactivate that option)
Looking at the source code it seems to me that the train error is calculated with dropout activated.
I think, in my understanding, that this train error should be calculated with dropout disabled so that it gets compared to val error which is always calculated with dropout disabled.


@sgugger I saw that you are a very large contributor of the library, maybe you can help me on this, I’ve been stuck on that for a few days.
As I understand the courses, we are running gradient descent until we find a model where : 1/ training error is low (–>low bias), 2/ validation error (as displayed by the fit method) nearly equals train error (as displayed by the fit method too). (–>low variance)
But in that fit method, the train error is calculated with dropout activated, while validation error is calculated with dropout deactivated.
Shouldn’t a low variance model be a model where the error on the train and error on validation are the same, but when calculated the same way, that is, with dropout deactivated in both cases ?


The ultimate goal is to find a model with the best validation metric (or error but a metric like accuracy is better). Training loss matters little, usually it’s very close to zero at the end of training, and if you get catastrophic overfitting (when your training loss is getting lower and lower while your validation loss gets higher and higher) you really won’t care about it.

We don’t want validation loss to equal training loss, we want it as little as possible (so often that means it must be close to the training loss, yes, but that’s not the original goal). If you don’t regulate your network at all during training, your training loss will go to zero and the validation loss will be completely random. To preven this there are a few regularization technique, among them, dropout.

All regularization techniques make it harder for the model to get the right output and dropout is no exception, hence a higher training loss (but we don’t care). What we do care about is that the good regularization techniques will help the model to generalize better, and even if your training loss won’t achieve the same minimum at the end, it will be a better minimum in terms of generalization, which should get you a better validation/test loss. Of course, when validating the model or using it in production afterward, we don’t want to apply those techniques again (except for TTA but it’s a different story).

You shouldn’t do the comparison between train/valid loss with or without dropout. What you should compare is the value of your validation loss at the end of a training without dropout, and at the end of a training with dropout. You should see a better value (otherwise, you may used too much dropout).


Thanks for the detailed answer.
I think a missing word might have changed the meaning of your sentence.
Do you mean “unless if you get catastrophic overfitting, you really won’t care about it.” ?

Thanks a lot for the great library !


No I meant that if you get catastrophic overfitting, you won’t care that your training loss is getting to zero.


Ok, got it.

The thing is, how do you make sure that ultimately, validation error will be close to test error ?

All these interrogations I have come from the fact that on a model I ran a couple of weeks ago I managed after running the optimization multiple times to find a model which had training error very low, validation rather low, but when checked the test error, it was very bad.
I think by running the optim that many times and keeping only the best model overall I might have overfitted the validation set somehow by finding a rather peculiar model that was both capable of explaining the train and val, but not the test.
I had a very small timeseries dataset: 80 points that I cut in 60,20,20. The model had around 3 times more parameter than the size of dataset, but this seems ok from what I have seen in fast.ai courses.

This is why I thought seeking close train & validation error might have prevented that ultimate spread between validation and test error.


You do need to be really careful about how you go about constructing your validation (and test) sets, particularly for time series. In case you haven’t seen it before, Rachel Thomas wrote a great blog post about this issue last year: http://www.fast.ai/2017/11/13/validation-sets/

Also, if you only have 80 data points in your time series, you may well find that a deep learning approach is a bit like using a sledgehammer to try and crack a nut.


Yes, data is unfortunately usually pretty sparse in industry.

I also had that feeling that the set was small, but did not see real «concrete» reasons why a NN would not be a good choice. What is the rock solid reason that a NN would need more data than any other machine learning model ?