ELU flattens out at -1 instead of zero for argument x < 0 . In part to address the issue with the mean.
We need a non-linearity because two linear functions in a row is just another linear function. And linear functions donāt recognize cats from dogs. ReLU is a fast non-linearity.
I understand why the weights might vanish (std deviation is decreasing with the layers due to the non-linearity), but why might they explode ?
loving classā¦learned about two topics that I was ignoring in code
- what is ā[ā¦, ]ā
- what is c[:,None]
If the scale of your weights is 1.5 after the fist layer and activation, theyāll be 1.5**2 at the second, and quickly get out of hand.
@PierreO, technically for Einstein summation convention, such situations are simply poorly defined, and likely are errors.
In any case, no summation is implied if the index appears in the same dimension.
Why does PyTorch transpose and we donāt? Is that just a stylistic choice for how we want to treat rows vs columns?
It might be linked to some kind of optimization, or maybe it just made more sense to them when they wrote that layer.
why couldnāt you just set a cutoff for the initialization instead of implementing all this math? ie all w1 must be between 0.001 and 0.01 (and then randomly draw from that with a generic gaussian distribution)
Because if you use a too tiny std, then your weights will vanish across the network and all become zeros. It is a tricky business: you have to be neither too big or too small.
Does Terence know LaTeX or all that LaTeX is copy pasted from wikipedia ? 
I think you just gave me the solution a problem Iāve been having with one of my networksā¦!
I am using a single embedding layer, but for 16 inputs. Iām guessing the the scales arenāt working out well because of it.
@rachel to help people get startedā¦
I can confirm that the docker works well. Docker Image of Fastai + Jupyter: CUDA10, py3.7
To update, log in via bash, and then git pull, reinstall, then git clone the part3 code. I have provided the commands there.
he made bookish a tool to convert md to latex
Would be cleaner if Jeremy used y=f(x) and y=g(x), just to avoid any confusion about composing functions.
He didnāt use f anf f afterward thought 
I love the course! I really hope that you will expand this course in some advanced topics like randomized linear algebra and the linkage to convex/non-convex optimization.
Sorry, why do his function defās not have āreturn ____ā at the end?
Is it a Python default that functions return the result of the last operation?
Itās storing the gradients, thatās all we need.
itās assigned to a layer via a mutation (class.g = whatever)