Like @Moody pointed out in previous posts, Jeremy also said the same in the fastbook that stacking 2 neurons (or linear functions) without non-linear activation in between “will just be the same as one linear” model.
However, as I showed the excel experiment in previous posts, a 2-linear function stacking on each other is a model much worse than (not the same as) a single linear model. How should I understand it? Is it something wrong with my experiment? @jeremy
Let me briefly describe the experiment: based on Jeremy’s 1-neuron model (1 linear layer model) with 2 weights a
and b
trying to find y=2x+30
, I built a 2-neuron model (2 linear layer model) with 4 weights a
, b
, c
and d
to do the same. Both models share the exact same dataset, same learning rate, same initial weights (all set to 1
). Both use the numerical derivative formula to calculate derivatives. you can check my numerical derivative formula here. You can run the experiment on the worksheet “basic SGD 2 neuron collapse” from this workbook.