Model.train() gives much lower loss than model.eval()

After sleeping on it, I think the issue is possibly related to using bs=1. This would cause the activations to blow up because they get divided by epsilon plus a variance of 0. What happens after that, I can only guess.

I moved this question to a new topic.
https://forums.fast.ai/t/how-to-deal-with-batchnorm-and-batch-size-of-1/83241

Any insights are greatly appreciated!