Lesson 5 In-Class Discussion

pramod.srinivasan · November 28, 2017, 4:35am

From now on, I won’t take the click-bait on Backprop. There seem to be hundreds of articles with the primary aim of telling me that Backprop is just chain rule.

cstorm125 · November 28, 2017, 4:36am

I think an important factor to improve a collaborative filtering is to find out the biases: types of items, types of users, etc. Maybe we can do clustering based on their attributes for additional features.

ecdrid · November 28, 2017, 4:36am

youtube.com might help in quenching thirst…

http://najeebkhan.github.io/blog/VecCal.html

KavaGal · November 28, 2017, 4:37am

I like 3Blue1Brown - he explains calculus and linear algebra, but I’m not sure about Jacobian or Hessian matrices specifically: https://www.youtube.com/3blue1brown

kcturgutlu · November 28, 2017, 4:37am

Problem is I probably searched for a good explanation which would allow me to get every idea for many many times before, so I need a specific source not a obvious one. Thanks

ezequiel · November 28, 2017, 4:38am

I think it’s more than chain rule, from what I understand chain rule is involved but in an specific order to be more efficient.

surmenok · November 28, 2017, 4:38am

It’s a clever application of a chain rule with some optimizations, like dynamic programming: remembering parts of computed values to not calculate the same thing many times. Chris Olah describes this trick well.

cstorm125 · November 28, 2017, 4:41am

I did a notebook to recreate all these chain rule shenanigans in numpy based a lot on Andrej Karpathy’s cs231n and Kolah’s blog. For those who might be interested: https://github.com/cstorm125/sophia

ecdrid · November 28, 2017, 4:42am

pierre · November 28, 2017, 4:42am

The chain rule for real-number functions is well known (f(g(x)))’ = f’(g(x))g’(x). To generalize that to vector-valued functions it’s more natural to think about how differentiation represents the best local linear approximation. So differentiating a composition of functions should translate to a composition of linear functions - and that’s represented by a product of matrices.

pramod.srinivasan · November 28, 2017, 4:43am

Some great post-lecture reading by Ruder : http://ruder.io/optimizing-gradient-descent/

surmenok · November 28, 2017, 4:46am

AdamW - is it this code? https://github.com/pytorch/pytorch/pull/3740/commits/ffddcc4ee2c35c00e54421bb0d1b145264288b24

ezequiel · November 28, 2017, 4:47am

I was surprised when I learned that Jeremy’s approach to derivation as dividing things can in fact be formalized mathematically in Non-Standard Calculus. So @jeremy thinking in that way is correct in some way =)

johnnyv · November 28, 2017, 4:49am

Here is the github issue, which has a link to the archivx paper:

surmenok · November 28, 2017, 4:49am

Numerical computation of derivatives can be useful to verify that analytical computation works as expected, in case you don’t use autodifferentiation capabilities that PyTorch or TensorFlow provide and implement derivative calculation yourself. An easy way to test differentiation code.

kalps00 · November 28, 2017, 4:57am

Will using options such as momentum and adam result in better accuracy? Or, is this more to speed up the training process?

pete.condon · November 28, 2017, 4:58am

It’s mostly for speeding up the training - but I suppose it depends on how complex your model is and how long you’re willing to keep training it.

cstorm125 · November 28, 2017, 4:58am

I guess the faster you train, the more likely you would be able to find a better minimum also?

pete.condon · November 28, 2017, 5:01am

This is the ADAMW paper: https://arxiv.org/pdf/1711.05101.pdf

Ray2 · November 28, 2017, 5:16am

Hi @jeremy,

I have a question on the F.dropout() in lesson5-movielens class EmbeddingNet.

Since the dropout() behaves differently in “training” and “testing”. Is it better to pass self.training into the F.dropout()?
So, instead of x = F.dropout(F.relu(self.lin1(x)), 0.75), maybe x = F.dropout(F.relu(self.lin1(x)), 0.75, self.training)?

Another question is when nn.Dropout is used, instead of nn.functional.dropout(), is it better to override the eval() function by including all nested “models” (like Dropout)'s eval()?
The implementation of eval() is not recursive, and nn.Dropout.forward() uses self.training instead of passing a boolean.

Is there any advantage of using F.dropout (in forward()) rather than defining a layer self.dropout = nn.Dropout(X) in initialization?
It seems using F.dropout is faster (might not be true), but using the nn.Dropout makes the neural net architecture easier to read.

Thanks.