Following text appears in Chapter 13:
“Batch normalization works by taking an average of the mean and standard deviations of the activations of a layer and using those to normalize the activations. However, this can cause problems because the network might want some activations to be really high in order to make accurate predictions. So we also added two learnable parameters (meaning they will be updated in the SGD step), usually called gamma and beta. After normalizing the activations to get some new activation vector y, a batchnorm layer returns gamma*y + beta.”
Can you please help me understand how adding two more learnable parameters (gamma and beta) solve the problem of normalizing the peaks a.k.a the “high activations needed to make accurate predictions”?
Thanks in advance!