ELU flattens out at -1 instead of zero for argument x < 0 . In part to address the issue with the mean.
We need a non-linearity because two linear functions in a row is just another linear function. And linear functions donât recognize cats from dogs. ReLU is a fast non-linearity.
I understand why the weights might vanish (std deviation is decreasing with the layers due to the non-linearity), but why might they explode ?
loving classâŚlearned about two topics that I was ignoring in code
- what is â[âŚ, ]â
- what is c[:,None]
If the scale of your weights is 1.5 after the fist layer and activation, theyâll be 1.5**2 at the second, and quickly get out of hand.
@PierreO, technically for Einstein summation convention, such situations are simply poorly defined, and likely are errors.
In any case, no summation is implied if the index appears in the same dimension.
Why does PyTorch transpose and we donât? Is that just a stylistic choice for how we want to treat rows vs columns?
It might be linked to some kind of optimization, or maybe it just made more sense to them when they wrote that layer.
why couldnât you just set a cutoff for the initialization instead of implementing all this math? ie all w1 must be between 0.001 and 0.01 (and then randomly draw from that with a generic gaussian distribution)
Because if you use a too tiny std, then your weights will vanish across the network and all become zeros. It is a tricky business: you have to be neither too big or too small.
Does Terence know LaTeX or all that LaTeX is copy pasted from wikipedia ?
I think you just gave me the solution a problem Iâve been having with one of my networksâŚ!
I am using a single embedding layer, but for 16 inputs. Iâm guessing the the scales arenât working out well because of it.
@rachel to help people get startedâŚ
I can confirm that the docker works well. https://forums.fast.ai/t/docker-image-of-fastai-jupyter-cuda10-py3-7/40081
To update, log in via bash, and then git pull, reinstall, then git clone the part3 code. I have provided the commands there.
he made bookish a tool to convert md to latex
Would be cleaner if Jeremy used y=f(x) and y=g(x), just to avoid any confusion about composing functions.
He didnât use f anf f afterward thought
I love the course! I really hope that you will expand this course in some advanced topics like randomized linear algebra and the linkage to convex/non-convex optimization.
Sorry, why do his function defâs not have âreturn ____â at the end?
Is it a Python default that functions return the result of the last operation?
Itâs storing the gradients, thatâs all we need.
itâs assigned to a layer via a mutation (class.g = whatever)