Why does this weird batch normalization work?

Hi guys,

I have a RNN of sequential tabular data. I had to create my own batch normalization because of the way I built it, and something funky happened.

Instead of calculating the moving average/var on the train set, and applying that to the test set, my code had an error, and I calculated only the mean/var of the first batch, and used that to normalize all data during training/validation.

What happened is that when I realized that I made a mistake and corrected it, my results got way worse. Have you guys ever seen some paper about normalizing all layers result based on a single batch? It reminded me of the Layer-sequential unit-variance ( LSUV ), where you do something similar, but to get the right initial weights for each layer.


This is actually how the Normalize() function works in fastai2 now. If you don’t pass anything, it calculates it off of the first batch :slight_smile:

1 Like

Interesting! Thanks for the answer. Do you know any paper that talks more about that? It is the first time I see it.

They empirically found it did a good enough job IIRC. The other thing to consider is how large of a standard deviation was the actual set vs your batch subset?