From now on, I won’t take the click-bait on Backprop. There seem to be hundreds of articles with the primary aim of telling me that Backprop is just chain rule.
I think an important factor to improve a collaborative filtering is to find out the biases: types of items, types of users, etc. Maybe we can do clustering based on their attributes for additional features.
I like 3Blue1Brown - he explains calculus and linear algebra, but I’m not sure about Jacobian or Hessian matrices specifically: https://www.youtube.com/3blue1brown
Problem is I probably searched for a good explanation which would allow me to get every idea for many many times before, so I need a specific source not a obvious one. Thanks
I think it’s more than chain rule, from what I understand chain rule is involved but in an specific order to be more efficient.
It’s a clever application of a chain rule with some optimizations, like dynamic programming: remembering parts of computed values to not calculate the same thing many times. Chris Olah describes this trick well.
I did a notebook to recreate all these chain rule shenanigans in numpy based a lot on Andrej Karpathy’s cs231n and Kolah’s blog. For those who might be interested: https://github.com/cstorm125/sophia
The chain rule for real-number functions is well known (f(g(x)))’ = f’(g(x))g’(x). To generalize that to vector-valued functions it’s more natural to think about how differentiation represents the best local linear approximation. So differentiating a composition of functions should translate to a composition of linear functions - and that’s represented by a product of matrices.
Some great post-lecture reading by Ruder : http://ruder.io/optimizing-gradient-descent/
AdamW - is it this code? https://github.com/pytorch/pytorch/pull/3740/commits/ffddcc4ee2c35c00e54421bb0d1b145264288b24
I was surprised when I learned that Jeremy’s approach to derivation as dividing things can in fact be formalized mathematically in Non-Standard Calculus. So @jeremy thinking in that way is correct in some way =)
Here is the github issue, which has a link to the archivx paper:
Numerical computation of derivatives can be useful to verify that analytical computation works as expected, in case you don’t use autodifferentiation capabilities that PyTorch or TensorFlow provide and implement derivative calculation yourself. An easy way to test differentiation code.
Will using options such as momentum and adam result in better accuracy? Or, is this more to speed up the training process?
It’s mostly for speeding up the training - but I suppose it depends on how complex your model is and how long you’re willing to keep training it.
I guess the faster you train, the more likely you would be able to find a better minimum also?
I have a question on the F.dropout() in
Since the dropout() behaves differently in “training” and “testing”. Is it better to pass
self.training into the
So, instead of
x = F.dropout(F.relu(self.lin1(x)), 0.75), maybe
x = F.dropout(F.relu(self.lin1(x)), 0.75, self.training)?
Another question is when
nn.Dropout is used, instead of
nn.functional.dropout(), is it better to override the
eval() function by including all nested “models” (like Dropout)'s
The implementation of
eval() is not recursive, and
self.training instead of passing a boolean.
Is there any advantage of using
forward()) rather than defining a layer
self.dropout = nn.Dropout(X) in initialization?
It seems using
F.dropout is faster (might not be true), but using the
nn.Dropout makes the neural net architecture easier to read.