Can someone explain it at https://youtu.be/H3g26EVADgY?t=5340

Why SGD will undo the normalization while BatchNorm works?

# Wiki: Lesson 7

**nok**(nok) #21

**nok**(nok) #22

@jeremy Could you elaborate that part a bit. After reading some articles, I get the intuition that why adding extra parameters could help in BatchNorm. But I still couldnât get what do you mean when you are saying

- SGD will undo it
- Why adding scaling parameters address this âundoâ issue.

**emilmelnikov**(Emil) #23

You can try to read the original BatchNorm paper, itâs quite accessible. In section 2, they give a little example of a layer that adds a learned bias, and then centers the result (that is, subtracts the mean). It turns out that if you write the expression for the layer output *after* the gradient update, the bias update term cancels out. So, even when the optimization procedure changes the bias parameter, the update doesnât change neither the layer output nor the loss.

It is not shown explicitly, but they claim that the same thing happens if you scale the input.

Here is a relevant excerpt from the paper (sorry for posting it as an image, but I canât find how to typeset math in Discourse):