is there an advantage of paying close attention to initialisation vs using something like batch norm?
Jeremy said that if you forgot something to go back to the part1 notebooks, but where are they? Is this https://github.com/fastai/fastai_docs/tree/master/dev_nb correct?
Oh yes, it will change the way your model train by a lot.
The notebooks from part 1 are here: https://github.com/fastai/course-v3/tree/master/nbs/dl1
in what way? faster convergence over batch norm?
Why do we want to keep variance =1 all the time? Is it because of the gradient explosion/vanishing problem?
Itās only a definition
Thanks!
What is the link I found thenā¦
how does your computer explode when taking std deviation and mean again and again?
Jeremy is explaining it now.
I believe it could even be like convergence vs. not convergence at all. (If you have networks with many layers, like it is discussed in the papers shared during previous lesson).
Yup, that is so
No itās multiplying a vector by the same matrix again and again.
this feels important to the discussion but I think I missed what jeremy was saying here.
Is there any reason not to initialize each layer and directly scale the weights/biases to achieve 0 mean/1 stdev based on the observed statistics, rather than pre-solving for particular activations/structures?
This isnāt really a ālittle differenceā. Activations/Initializations are crucial parts of the network and probably a detail the biggest % of realtively new users never adjusts default values.
Good job!
is it sqrt(5) or sqrt(3) ?
Thatās what BatchNorm does.
Link to the Twitter thread that Jeremy had mentioned
BatchNorm isnāt about init, though, itās applied during net operation, right?