Wiki: Lesson 7

(nok) #21

Can someone explain it at
Why SGD will undo the normalization while BatchNorm works?

(nok) #22

@jeremy Could you elaborate that part a bit. After reading some articles, I get the intuition that why adding extra parameters could help in BatchNorm. But I still couldn’t get what do you mean when you are saying

  1. SGD will undo it
  2. Why adding scaling parameters address this “undo” issue.

(Emil) #23

You can try to read the original BatchNorm paper, it’s quite accessible. In section 2, they give a little example of a layer that adds a learned bias, and then centers the result (that is, subtracts the mean). It turns out that if you write the expression for the layer output after the gradient update, the bias update term cancels out. So, even when the optimization procedure changes the bias parameter, the update doesn’t change neither the layer output nor the loss.

It is not shown explicitly, but they claim that the same thing happens if you scale the input.

Here is a relevant excerpt from the paper (sorry for posting it as an image, but I can’t find how to typeset math in Discourse):


(nok) #24

Thx, I actually just printed the paper out, will have a look soon.