Why Use BatchNorm for Continuous Data when they are already Normalized?

When we deal with Tabular Data, we use a TabularList and TabularLearner. We use a number of pre-processors such as FillMissing, Categorify and Normalize. Now, after creating a Tabular Learner as (taken from Lesson 6):-

learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04, 
                        y_range=y_range, metrics=exp_rmspe)

and then printing out the model with:-


we get this:-

(embeds): ModuleList(
(0): Embedding(1116, 81)
(1): Embedding(8, 5)
(2): Embedding(4, 3)
(3): Embedding(13, 7)
(4): Embedding(32, 11)
(5): Embedding(3, 3)
(6): Embedding(26, 10)
(7): Embedding(27, 10)
(8): Embedding(5, 4)
(9): Embedding(4, 3)
(10): Embedding(4, 3)
(11): Embedding(24, 9)
(12): Embedding(9, 5)
(13): Embedding(13, 7)
(14): Embedding(53, 15)
(15): Embedding(22, 9)
(16): Embedding(7, 5)
(17): Embedding(7, 5)
(18): Embedding(4, 3)
(19): Embedding(4, 3)
(20): Embedding(9, 5)
(21): Embedding(9, 5)
(22): Embedding(3, 3)
(23): Embedding(3, 3)
(emb_drop): Dropout(p=0.04)
(bn_cont): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(layers): Sequential(
(0): Linear(in_features=233, out_features=1000, bias=True)
(1): ReLU(inplace)
(2): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Dropout(p=0.001)
(4): Linear(in_features=1000, out_features=500, bias=True)
(5): ReLU(inplace)
(6): BatchNorm1d(500, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Dropout(p=0.01)
(8): Linear(in_features=500, out_features=1, bias=True)

As seen, Embeddings are used for the Categorical Variables, while the Continuous Variables are first connected to a BatchNorm layer. Batch Normalization is used to somewhat normalize the activations, if I understand correctly i.e., they tend to bring the activations in a layer across muiltiple mini batches to a similar mean(Beta) and standard deviation(Gamma).
Since we have already normalized the continuous variables using the Normalize pre-processor, why do we need to again pass them through a BatchNorm and do a similar thing?

Because the further down the model you go, the further away from the norm they get. Remember we normalize vision data too, either to the datasets mean and std or to ImageNet, and they still contain BatchNorm

1 Like

So, it is used to get more controlled activations in a specific range and mean, that in turn allows the model to train faster.
Thank you so much.