I have a RNN of sequential tabular data. I had to create my own batch normalization because of the way I built it, and something funky happened.
Instead of calculating the moving average/var on the train set, and applying that to the test set, my code had an error, and I calculated only the mean/var of the first batch, and used that to normalize all data during training/validation.
What happened is that when I realized that I made a mistake and corrected it, my results got way worse. Have you guys ever seen some paper about normalizing all layers result based on a single batch? It reminded me of the Layer-sequential unit-variance ( LSUV ), where you do something similar, but to get the right initial weights for each layer.