Batch Normalization

Can someone explain how does batch norm is implemented in a neural network? I’ve been reading in different sources in regards to the benefits that it affects the overall performance of the network, but none has specifically explained the mechanism of it. Also, if we have already normalized our training dataset is batch norm even necessary? It seems like the exploding/diminishing gradient is solved already through preprocessing the training set by normalization.

Little back story, I’ve been experimenting with the tabular learner for about 6 months now training with financial data. So far the performance of my network is still battling to use batch norm or not, this is why I need to know how is this actually implemented in tabular in order to decide whether if we need batch norm in the first place.

Thank you in advance!

The formula for batch norm (BN) is:

output = gamma * (input - mean) / sqrt(variance + epsilon) + beta

Here, mean and variance are computed over the batch. Epsilon is a small number to make sure the denominator doesn’t become 0. Gamma and beta are learned parameters (just like neural network weights).

The BN layer also keeps track of a moving mean and variance, which are used at inference time.

That’s all there is to it, really.

1 Like

First of all, thank you so much for writing out the actual formula!

So in this case, is it even necessary to use batch norm if we have already normalize our input data?

in part 2 of course-v3, Jeremy completely covers that, both implementation and intuition behind it

1 Like

BatchNorm is useful before the later layers too :slight_smile:

When you normalize (or actually standardize) your input, it’s great, but in later layers it’s a good idea to standardize again and again because the nonlinearities and biases will change the distribution of the data through the learning.

Another thing when you use BN it’s also put some regularization effect on the training, so you can get better results - still a research area exactly how it works though.
I saw other papers claimed it smoothes the loss function, so easier to optimize.

Whatever is the real reason, it seems it helps you in a way, so people stick with it.

1 Like