# Wiki / Lesson Thread: Lesson 9

**melissa.fabros**(melissa.fabros) #1

Wiki / Lesson Thread: Lesson 8

About the Intro to Machine Learning category

**groverpr**(Prince Grover) #2

I have a few questions from the class –

`net = nn.Sequential( nn.Linear(28*28, 10), nn.LogSoftmax() )`

In last non-linear layer, why did we use `logsoftmax`

, not `softmax`

? Weren’t we exponentiating outputs from 2nd last layer so as to make them all +ve ? Why back to `log`

after doing [exp]/[sum of exp].

`n.Parameter(torch.randn(*dims)/dims[0])`

What is the reason of dividing by dims[0]. I tried and it doesn’t work if we don’t divide by dims[0]. By it doesn’t work, I mean fit() gives loss = nans and very bad accuracy.

Thanks

**jeremy**(Jeremy Howard (Admin)) #5

The loss functions in pytorch generally assume you have LogSoftmax, for computational efficiency reasons: https://discuss.pytorch.org/t/does-nllloss-handle-log-softmax-and-softmax-in-the-same-way/8835

This is *He initialization* (http://www.jefkine.com/deep/2016/08/08/initialization-of-deep-networks-case-of-rectifiers/) . Although I may have forgotten a `sqrt`

there…

Without careful initialization you’ll get gradient explosion. We discuss this in the DL course.

**rajeshtamada**(Tamada Rajesh Kumar) #8

hi there,

I have a question pertaining to optimizer.zero_grad() . I have gone couple of times over the section where it is explained why do we have to call this function.

I still don’t understand it .

From pytorch forum , i understand unless for the special cases where one wants to simulate bigger batches by accumulating the gradients , one has to invoke optimizer.zero_grad() to clear the grandients for the next batch.

Would like to understand Jeremy explanation though.