Model.train() gives much lower loss than model.eval()

Pomo · December 9, 2020, 6:42am

Can anyone explain the significance?

The first line calculates all the losses. The second computes their mean. When the model is in eval mode, the mean loss is 5x greater. There are many nn.BatchNorm3d’s in modelR.

This divergence between train and eval loss appeared as training progressed and got bigger. For most of the training, the eval loss had been smaller than the training loss.

What I am hoping for is an intuitive explanation of what is going on. Thanks!!!

modelR.train()
trainedLosses = [lossfn(mtarget.cuda(),modelR(mbatch.cuda())).item() for (mbatch,mtarget) in training_generator_mem()]
mean(trainedLosses)
0.0002706261747435848

modelR.eval()
trainedLosses = [lossfn(mtarget.cuda(),modelR(mbatch.cuda())).item() for (mbatch,mtarget) in training_generator_mem()]
mean(trainedLosses)
0.001313385098102218

Pomo · December 9, 2020, 7:53pm

After sleeping on it, I think the issue is possibly related to using bs=1. This would cause the activations to blow up because they get divided by epsilon plus a variance of 0. What happens after that, I can only guess.

I moved this question to a new topic.
https://forums.fast.ai/t/how-to-deal-with-batchnorm-and-batch-size-of-1/83241

Any insights are greatly appreciated!