Lesson 5 In-Class Discussion


(Pramod) #105

From now on, I won’t take the click-bait on Backprop. There seem to be hundreds of articles with the primary aim of telling me that Backprop is just chain rule.


(Charin) #106

I think an important factor to improve a collaborative filtering is to find out the biases: types of items, types of users, etc. Maybe we can do clustering based on their attributes for additional features.


(ecdrid) #107

youtube.com might help in quenching thirst…

http://najeebkhan.github.io/blog/VecCal.html


(Kristin) #108

I like 3Blue1Brown - he explains calculus and linear algebra, but I’m not sure about Jacobian or Hessian matrices specifically: https://www.youtube.com/3blue1brown


(Kerem Turgutlu) #109

Problem is I probably searched for a good explanation which would allow me to get every idea for many many times before, so I need a specific source not a obvious one. Thanks


(Ezequiel) #110

I think it’s more than chain rule, from what I understand chain rule is involved but in an specific order to be more efficient.


(Pavel Surmenok) #111

It’s a clever application of a chain rule with some optimizations, like dynamic programming: remembering parts of computed values to not calculate the same thing many times. Chris Olah describes this trick well.


(Charin) #112

I did a notebook to recreate all these chain rule shenanigans in numpy based a lot on Andrej Karpathy’s cs231n and Kolah’s blog. For those who might be interested: https://github.com/cstorm125/sophia


(ecdrid) #113

(Pierre Dueck) #114

The chain rule for real-number functions is well known (f(g(x)))’ = f’(g(x))g’(x). To generalize that to vector-valued functions it’s more natural to think about how differentiation represents the best local linear approximation. So differentiating a composition of functions should translate to a composition of linear functions - and that’s represented by a product of matrices.


(Pramod) #115

Some great post-lecture reading by Ruder : http://ruder.io/optimizing-gradient-descent/


(Pavel Surmenok) #116

AdamW - is it this code? https://github.com/pytorch/pytorch/pull/3740/commits/ffddcc4ee2c35c00e54421bb0d1b145264288b24


(Ezequiel) #117

I was surprised when I learned that Jeremy’s approach to derivation as dividing things can in fact be formalized mathematically in Non-Standard Calculus. So @jeremy thinking in that way is correct in some way =)


(john v) #118

Here is the github issue, which has a link to the archivx paper:


(Pavel Surmenok) #119

Numerical computation of derivatives can be useful to verify that analytical computation works as expected, in case you don’t use autodifferentiation capabilities that PyTorch or TensorFlow provide and implement derivative calculation yourself. An easy way to test differentiation code.


(Kalpana Vora) #120

Will using options such as momentum and adam result in better accuracy? Or, is this more to speed up the training process?


(Pete Condon) #121

It’s mostly for speeding up the training - but I suppose it depends on how complex your model is and how long you’re willing to keep training it.


(Charin) #122

I guess the faster you train, the more likely you would be able to find a better minimum also?


(Pete Condon) #123

This is the ADAMW paper: https://arxiv.org/pdf/1711.05101.pdf


(Yihui Ray Ren) #124

Hi @jeremy,

I have a question on the F.dropout() in lesson5-movielens class EmbeddingNet.

Since the dropout() behaves differently in “training” and “testing”. Is it better to pass self.training into the F.dropout()?
So, instead of x = F.dropout(F.relu(self.lin1(x)), 0.75), maybe x = F.dropout(F.relu(self.lin1(x)), 0.75, self.training)?

Another question is when nn.Dropout is used, instead of nn.functional.dropout(), is it better to override the eval() function by including all nested “models” (like Dropout)'s eval()?
The implementation of eval() is not recursive, and nn.Dropout.forward() uses self.training instead of passing a boolean.

Is there any advantage of using F.dropout (in forward()) rather than defining a layer self.dropout = nn.Dropout(X) in initialization?
It seems using F.dropout is faster (might not be true), but using the nn.Dropout makes the neural net architecture easier to read.

Thanks.