Batch Normalization question from Chapter 13

fastuser · February 11, 2021, 5:01am

Hello,

Following text appears in Chapter 13:

“Batch normalization works by taking an average of the mean and standard deviations of the activations of a layer and using those to normalize the activations. However, this can cause problems because the network might want some activations to be really high in order to make accurate predictions. So we also added two learnable parameters (meaning they will be updated in the SGD step), usually called gamma and beta. After normalizing the activations to get some new activation vector y, a batchnorm layer returns gamma*y + beta.”

Can you please help me understand how adding two more learnable parameters (gamma and beta) solve the problem of normalizing the peaks a.k.a the “high activations needed to make accurate predictions”?

Thanks in advance!

Pomo · February 11, 2021, 5:42am

Without gamma and beta, the per channel mean and sd would always be adjusted to zero and one. This limits what the next layer can know about the incoming activations. For example, it will not be able to discern that one channel is larger than another on average. Beta and gamma will be learned and become the new mean and sd for that channel, different from the other channels.

That’s a paraphrase of the “official” explanation. There is still lots of debate about why batchnorm really works.

fastuser · February 11, 2021, 2:07pm

Oh wow! I vaguely understood that beta and gamma really just form another linear transformation in order to learn the large activations that were normalized away by batchnorm but your explanation clarifies it further.

I watched the video corresponding to Chap 13 and read the entire chapter too but never caught on to normalization being “per channel”. Are you sure about this? The reason I ask is the following line from Chap #13 ":

Each mini-batch will have a somewhat different mean and standard deviation than other mini-batches."

The aggregation of activations to be normalized seems to be at a batch level (and not at a channel level)?

I liked your quotes around “official” Yeah I read there is no consensus on why batchnorm works. Haha reminds me of all those times the code worked when it really shoudn’t have

Pomo · February 11, 2021, 8:36pm

Hi fastuser,

There’s a separate running mean, running variance, beta, and gamma for each channel of the input images. The first two are calculated per batch, and the latter via gradients. But you could have seen this yourself from the PyTorch docs or source code.

I did not understand any of these details until just a few weeks ago. I was forced to learn when someone else’s pretrained model with BatchNorm3d’s made some problems. There’s really no substitute for actual practice. So I suggest making a nn.BatchNorm2d, a batch of fake images of random numbers, and experimenting. You can see the internal variables, how they are used, and how everything fits together. Before you know it, you will be an expert on BatchNorm and be answering forum questions.

fastuser · February 12, 2021, 6:38am

Agreed 100%! Thank you once again Malcolm!