Dropout hides the actual overfit

It’s well known dropout layer reduces the overfit of the neural network. But sometimes dropout may have a non-obvious negative effect – it may hide the fact of overfitting.

The issue background is the following:

  • I use fastai to build a deep fully connected neural network.
  • I use dropout to prevent overfitting
  • One of the layers in my network has 7 neurons and 0.5 dropout.
  • The validation mechanism in my task requires me to keep training loss and validation loss about the same. If training loss is much lower than validation loss - I get a huge error on the test set.

So when I train the network - at some point I get a good looking result: train loss 1.62, validation loss 1.58. But I feel like the model is overfitted, and inference on the test set says the same – the model is overfitted. Finally I found the issue. In fastai train loss is calculated during the training process, i.e. dropout is applied and the train loss is high because of the dropout. Validation loss is calculated without the dropout. The actual train loss (without dropout) is 0.98.

The issue is caused by the fact I use high dropout (0.5) on a small layer (7 neurons). If there’s 20 neurons - difference between dropped out train loss and actual train loss becomes significantly lower.

So be careful using dropout on small layers - it may overestimate the actual train loss.


I will share my thoughts since I do not fully understand your conclusion.
You claim that the network has a overfitting problem. Overfitting is when you have good fit (low error) in the training and high error in test and validation. (https://en.wikipedia.org/wiki/Overfitting). But you have about the same error in both training and validation (slightly higher error in training).
When you add more nodes or reduce the dropout you let more information pass through the network and the error becomes smaller.
I would classify this as an under fitting problem.

Yes, overfitting is low training error and high validation error. And yes, in my case during the training I have train error 1.62 and validation error 1.58. It looks like the network is underfitted.

This topic is related to the fact that because of the dropout layers the train error number is wrong. Train error is calculated with the dropout enabled. Validation error is calculated with the dropout disabled. When I calculate train error with dropout disabled the error equals 0.92, it means the network is overfitted. And this overfitting isn’t visible during the training. You have to manually calculate the train loss without the dropout to find this out.

You wouldn’t want to enable drop-out in validation. The network isn’t learning anything at that point, so you want to use everything the network has learned to determine validation loss. Same as if you were using a model for predictions - you wouldn’t include drop-out then.

FWIW, I could be misunderstanding your point here - I’m no expert by far.

My work process is as follows, Start with a very simple model without any regularization at all just to make sure the network is learning. Then add different types of regularization and change parameters, optimizer initialization etc to see how to get the most robust result.
If this was my case I would probably have seen a huge difference between training and validation without any regularization. Then I add dropout and maybe lower the difference a bit and get smaller error in validation. If I for example add weight decay, I cant compare the training error because of the nature of weight decay. That’s why I have another metric like accuracy or something else that is comparable.
Over and underfitting for me is just an indication of the behavior of the network to figure out what to do next. My main goal is to lower the error/accuracy in the validation set. So, this is a good conclusion from your point but it does not give me enough information to change my work-process.

It would be interesting to know how this conclusion affect your work-process.

Sure, dropout isn’t needed neither in validation nor in the inference. But in some cases (and in mine also) it may be useful to calculate the train loss without the dropout to understand the actual difference between train & test losses.

Thank you for the great advice. I think your approach makes great sense.
I’m not very experienced in the ML yet so I started with a different approach. I started with a complex model without regularization and then simplified the model gradually and added dropout at some point. In my case when I added dropout it looked like a magic - train error became higher and validation error became lower. Then I was confused by the fact train error changed a lot when I changed the number of neurons in the “bottleneck” layer.