How SGD loses control and derivatives and errors go exploding?
2 linear layer model without non-linear function in between is just another linear layer model. But why in experiment the 2 linear layer model is much worse than a 1-linear layer model?
What happens when you add a ReLU to the 1st neuro of 2-neuron model? (train freely and 3 weights fixed and derivatives stay zero, does 1-neuron do the same? )
What does momentum look like and what is the intuition behind?