Live coding 16

Like @Moody pointed out in previous posts, Jeremy also said the same in the fastbook that stacking 2 neurons (or linear functions) without non-linear activation in between “will just be the same as one linear” model.

However, as I showed the excel experiment in previous posts, a 2-linear function stacking on each other is a model much worse than (not the same as) a single linear model. How should I understand it? Is it something wrong with my experiment? @jeremy

Let me briefly describe the experiment: based on Jeremy’s 1-neuron model (1 linear layer model) with 2 weights a and b trying to find y=2x+30, I built a 2-neuron model (2 linear layer model) with 4 weights a, b, c and d to do the same. Both models share the exact same dataset, same learning rate, same initial weights (all set to 1). Both use the numerical derivative formula to calculate derivatives. you can check my numerical derivative formula here. You can run the experiment on the worksheet “basic SGD 2 neuron collapse” from this workbook.

1 Like