Can someone explain it at https://youtu.be/H3g26EVADgY?t=5340

Why SGD will undo the normalization while BatchNorm works?

# Wiki: Lesson 7

**nok**(nok) #21

**nok**(nok) #22

@jeremy Could you elaborate that part a bit. After reading some articles, I get the intuition that why adding extra parameters could help in BatchNorm. But I still couldnât get what do you mean when you are saying

- SGD will undo it
- Why adding scaling parameters address this âundoâ issue.

**emilmelnikov**(Emil) #23

You can try to read the original BatchNorm paper, itâs quite accessible. In section 2, they give a little example of a layer that adds a learned bias, and then centers the result (that is, subtracts the mean). It turns out that if you write the expression for the layer output *after* the gradient update, the bias update term cancels out. So, even when the optimization procedure changes the bias parameter, the update doesnât change neither the layer output nor the loss.

It is not shown explicitly, but they claim that the same thing happens if you scale the input.

Here is a relevant excerpt from the paper (sorry for posting it as an image, but I canât find how to typeset math in Discourse):

**kro**(karenerobinson) #25

Note that you need to use accuracy_np(preds,y) rather than accuracy(preds,y)

**Kjeanclaude**(Kouassi Konan Jean-Claude) #26

Unless something change, during the January 2018 session I runned it with **accuracy()** function and all worked perfectly.

Even if * accuracy_np()* function works too, I think the sole difference is that

*and*

**accuracy() is the torch version***of accuracy; please look at the function here :*

**accuracy_np() is a numpy version**Given the things, as the fastai library continuously evolves (* new improvements*), there is also a

*I will try later, if I get a bit of free time.*

**new cpu only version****Brad_S**#27

Hereâs a go from my level of understandingâŚ Lets say SGD wants to push an activation function up, i.e. increase the mean (but here maintain the shape / variance). BN would reduce it, SGD would push up - but thereâs no easy way for it to do so, so it would try changing lots of weights. BN would reduce it. = fail.

Now - all the previous weights stay the same, SGD changes the 1 scaling procedure and up it goes SGD happy, BN happy.

you can do a similar thought experiment for the m multiplier

**username_not_found**(unknown) #28

How can I draw the CAM plot for the Dog class? Iâm confused where the Cat class is specified in https://github.com/fastai/fastai/blob/master/courses/dl1/lesson7-CAM.ipynb

**Dooley**(Guillaume) #29

Hello,

I have a question regarding the stateful RNNs, and the behaviour of the hidden state during training and inference.

**Training phase**

Within a given chunk of text (one of the 64 pieces of text if batch size = 64 for example), assuming bptt = 8 as in the notebook example.

- Batch 0. We take Characters 0 to 7 to predict characters 1 to 8. The hidden state at the end of batch 0 corresponds to the hidden state value to predict character 8 (result of the RNN time iteration at character 7).
- Batch 1. We take Characters 8 to 15 to predict characters 9 to 16. For the first time iteration, we have the hidden state corresponding to the previous prediction at step 7, which contains the history of steps 0 to 6. Therefore this allows making stateful predictions with a hidden state taking the old values into account and the new and old positions match for the hidden state.

**Inference phase**

Reading at the code at the end of the notebook, the last cell generates a 400 characters prediction. The text is built character by character as follows:

- Use characters 0 to 7 to predict character 8. Hidden state corresponds to character 7.
- Use characters 1 to 8 to predict character 9. For the first character we take a hidden state ârepresentingâ the history at character 7 when we start the loop and make predictions for character 2.
- Use characters 2 to 9 to predict character 10. For the first character we take a hidden state ârepresentingâ the history at character 8 to make predictions for character 3.

Basically I feel like the hidden state we should use as we loop through is not the hidden state at the end of a BPTT sequence, but the hidden state at the second iteration of the RNN loop.

Did I miss something?

**davidahern**(David Ahern) #31

hi, just started to watch the videos

when talking about the same thing, itâs described as

48:50 - neural net one hidden layer

49:25 - neural net no hidden layer

thanks

**hoon**#33

Iâm a little confused about this line regarding stateful RNN:

`m = CharSeqStatefulRnn(md.nt, n_fac, 512).cuda()`

We pass in 512 as batch size, but it should be 64, right? (since we set `bs=64`

and pass it in as a param for `LanguageModelData.from_text_files`

). I logged out the size of `self.h`

and after the first iteration the dimension corrects to `[1, 64, 256]`

, since we check if `self.h.size(1) != bs`

, but I was a bit confused where 512 came from.

**shubham3121**(Shubham Gupta) #34

Have written a blog on Generating your own music using RNNs. Hope you enjoy it.

**dom**#35

Has anyone else tried to get the CAM things at the end working on specific input images. Can get an prediction out for test image (using learn.predict_array) but struggling to work out how to get heatmap to see different parts of images for individual test image.

**shivamsaboo17**(Shivam) #36

@jeremy I have a doubt regarding resnet.

The resnet block simply does the operation: x + f(x) where x is the input and f(x) is output of BnLayer. But the dimensions of input and output wonât be same so how can these two be added?

**abhishekmishra**(Abhishek Mishra) #38

Below line will make sure that batch size is set according to batch passed from training_data_loader.

if self.h.size(1) != bs: self.init_hidden(bs)

**abhishekmishra**(Abhishek Mishra) #39

Hi Shivam,

I understand you are referring overridden âforwardâ method of âResnetLayerâ class.

Please note that we are passing stride=1 while instantiating âResnetLayerâ class,hence it will create conv2d object with 1 stride during class instantiation.(check *init* of âBnLayerâ class)

*self.layers2 = nn.ModuleList([ResnetLayer(layers[i+1], layers[i + 1], 1) for i in range(len(layers) - 1)])*

When you call f(x) i.e. super.forward() ,forward method passes input x to conv object where stride is 1 hence it does not half the input after convolution and dimensions remain unchanged.

Hope it helps.