The difference is I set the x value between -80 to 120 with a step of 10. So, I can draw a line chart easily.
You used random numbers to generate x. Because Excel can treat each row as a data point. So, the chart may not the true reflection of the data. Check your "Horizontal (Category) Axis Labels, the setting should link to the cells for x value (not hard code value generated by Excel). I pushed my file to your repo. So, you can inspect it.
Now I have a more clear view of what I tried to experiment.
First of all, Jeremy used graddesc.xlsx file to demonstrate the following things for us:
how are weights updated using SGD?
how better techniques like Momentum, Adam etc can speed up the training with SGD?
all these techniques can be implemented with a linear model with 2 weights.
To better appreciate Jeremy’s teaching above, I want to do the same above with a slightly more complex model. To be more specific, I want to train a 2-neuron (4 weights) model or a model of two linear functions (one on top of the other) to find the simple linear function y = 2x + 30. In Jeremy’s spreadsheet, he used a linear model y = ax+b to find y=2x+30. So, my model is slightly more complex.
Is my model smarter and faster than a simple linear model?
At first, I expected my model to be smarter may even be faster as it is slightly more complex. However, it’s error exploded without finishing one epch of training, not mentioning it is so much worse than the 1-neuron model by Jeremy.
Here, I’d like to propose some interesting findings below. The updated spreadsheet can be downloaded here
When training to find a simple linear function y = 2x + 30:
2-neurons without activation or ReLU is not simply collapsed into a single neuron, because it can hardly train and when it finishes an epoch, the error is much much worse than a single neuron model.
2-neurons model with ReLU can train freely, so ReLu makes multi-neuron models working.
but this 2-neurons with ReLU still train much slower than 1-neuron model
Those findings above are interesting, because I can’t explain why exactly. To find out why, I plan to try to visualize the experiment data to get a feel about how error get exploded, and read fastbook chap 4 and other related chapters.
Any advice on how should I explore and experiment on those questions above are very very welcome.
Hi Sarada, thanks a lot for Excel chart tips. They are very helpful, and now I can create similar graphs like yours.
However, using your dataset in the excel file you made for me, I still can’t reproduce the exact graph you did. (I tried to adjust the y-axis range to your exactly like yours. I also inspected the data columns you used for plotting the graph, everything is the same when I made the two graphs on the same worksheet.)
But as you can see in the image below, on your graph the data plotted seem not right (there is no data points greater than 1000 or less than -400). But when I inspect your dataset in your graph, all data selected are the same to mine. I simply can’t reproduce your graph
Why do errors, weights and derivatives all go exploding? Given SGD using derivatives to tell the model which direction should weights go and by how much, why would weights and errors go crazy?
yes, derivatives of weight to error do tell us which direction to go in order to decrease error, and also tell us if the weight goes up by 0.01 how much lower would error go. However, neither you nor the derivative tell you how far your weight is from the optimal value for error to be minimum.
So, when you calculate how big a step your weight is taking, SGD says besides using derivatives and you should put a knob (i.e., learning rate) to adjust the weight step manually, which is very clever.
Your model won’t train without an appropriate learning rate
Setting an appropriate learning rate is very important to ensure we can start training without exploding, because you could image the first step of your weight can be so large that your weight can’t move closer to the optimal value and error can’t go down. See what happens to the model when setting learning rate from 0.0001 to 0.01 below
So, having an appropriate learning rate at least for the starting section of the dataset is crucial to keep training going, meaning weights can move in good steps toward optimal weight values and error is decreasing.
Derivatives of changing weights seem unpredictable, how SGD using learning rate to manage weights toward optimal in most cases?
However, one shoe size can’t fit all feet. You can’t guarantee the next derivative value is always smaller than the previous one. In fact, sometimes, the derivatives can go much larger than the previous derivatives. As long as the derivatives are not too large, and the step size is still appropriate, given the correct moving direction provided by derivatives, so the weights can still move towards the optimal regardless how fast or slow.
How derivatives and errors go exploding under SGD’s reign?
However, in some cases the derivatives can be so large that the learning rate is no longer appropriate, and the step the weight makes is a big step away from the optimal even though the direction provided by the derivative is correct. And the derivative at the new point may well even be larger, and therefore the weight will even further away. It could become a self-reinforcing loop to move further and further away from the optimal. No chance to bring them down anymore. So, derivatives and error all go exploding. (see the graph above)
This is today’s investigation and speculation. If you find anything suspicious or wrong, please share with me, thanks
Hi @jeremy, if I keep posting about fastai and excel in this thread (Live coding 16), will you eventually split it into a different thread (e.g., exploring fastai with excel)?
If so, will all the links of the splitted posts be changed? Since I am documenting every knowledge point (using post links) I learnt with @Moody and others on this topic, I care very much whether the links will become unusable after the split.
If they change, I hope you (assuming you are the only one can do it) don’t split the relevant part of this thread so that we can keep the links unchanged. Maybe I can start a new thread to do fastai exploring with excel? What’s your suggestion on this?
Thanks a lot Sarada for the guidance and support! I surely love to explore the resources you pointed out some time later.
To be honest, I just start to feel a little more comfortable working with Excel on the 2-neuron model of mine. I would like to stay in the comfort zone a little longer . I would like to thoroughly go through all of Jeremy’s spreadsheets first which are also directly related to the part 1 lectures. What do you think? because to me, they are all very exciting and new worlds to be discovered and experimented.
Yes, so, can we propose that the non-linear activation is the key to turning the ordinary linear layers/neurons into magical neuralnet? Is it true? I think the excel experiment below provide some support to this proposal. @Moody
The observations from the experiment:
after adding a ReLU to the first neuron, the 2-neuron model can train freely without error exploding (it must have been magic of ReLU, right?)
within the first epoch, 3 out of 4 weights found their optimal and stop updating themselves (Whereas, neither of the two weights of 1-neuron model’s are still updating without settling, it must have been magic of ReLU, right?)
however, although improve steadily the error is far worse than the 1-neuron model. (Interesting!)
Like @Moody pointed out in previous posts, Jeremy also said the same in the fastbook that stacking 2 neurons (or linear functions) without non-linear activation in between “will just be the same as one linear” model.
However, as I showed the excel experiment in previous posts, a 2-linear function stacking on each other is a model much worse than (not the same as) a single linear model. How should I understand it? Is it something wrong with my experiment? @jeremy
Let me briefly describe the experiment: based on Jeremy’s 1-neuron model (1 linear layer model) with 2 weights a and b trying to find y=2x+30, I built a 2-neuron model (2 linear layer model) with 4 weights a, b, c and d to do the same. Both models share the exact same dataset, same learning rate, same initial weights (all set to 1). Both use the numerical derivative formula to calculate derivatives. you can check my numerical derivative formula here. You can run the experiment on the worksheet “basic SGD 2 neuron collapse” from this workbook.
The main issue is that gradient descent doesn’t necessarily find the best answer. In high dimensions (e.g big neural nets) it generally gets pretty close, if you’ve got a good learning rate. But in low dimensions, it often doesn’t.
So experiments like this won’t tell you much about performance on real models, unfortunately.
My understanding of what you said above is: Among neural nets (non-linear functions added), in low dimensions, meaning among narrow and non-deep neural nets, SGD does not always find the best answer for the slightly bigger models; but in high dimensions, between large neural nets, SGD with good learning rate generally can help large models get closer to the optimal than smaller ones. ( This makes a lot of sense, and I’d like take this to be true)
And I assume when you mention ‘high and low dimensions’, you only refer to neural nets with nonlinearity (and there is no such thing as neural nets without nonlinearity, right?), correct?
Because in the book and in videos you are saying multiple linear functions on top of each other is just another single linear function with different weights. So, there is no high and low dimension difference between a single linear model and a model of multiple linear function on top of each other without nonlinearity.
S: Mathematically, we say the composition of two linear functions is another linear function. So, we can stack as many linear classifiers as we want on top of each other, and without nonlinear functions between them, it will just be the same as one linear classifier.
An intuitive expectation would be the mathematical similarity between a single linear function model and a model of multiple linear functions on top of each other should suggest the same or very similar on model performance.
To be more specific:
In the book and lectures you said stacking multiple linear functions together is really just “a single linear layer with a different set of parameters” e.g. a 2-linear functions model can be collapsed into y = acx + c*b+d. So, we would expect this collapsed linear model would perform the same or very similar as our y = ax + b if not better. But the experiment shows this collapsed linear model cannot even finish training a single epoch, not mentioning the terrible error.
So, Does it suggest from perspective of model performance the two models are not the same, even though mathematically they are the same?
Basically, you can apply Momentum to a wiggly line. In the first example below we are using momentum to smooth y = sin(x) out. It smooths the line by give a large weight (in deep learning the convention is 90% on the previous momentum value and 10% weight to sin(x). so we have the momentum formula to be
We use SGD to update weights using derivatives. Derivatives like to swing between large positive and negative values. As a result, weight values swing too.
The intuition of momentum I suspect to be:
Weights jumping back and forth in big steps won’t bring them to the optimal efficiently. But if weights can keep their movement momentum, and gradually increase the steps, there should be higher chance of getting to the optimal.
First Example: Let’s visualize the momentum function
In this experiment we apply momentum formula to a sine function, and plot it to see how they look. Note: I set the first original momentum value to be 0
Can you get a feel about how momentum smooth things out or keep the momentum of previous state from the graph below?
Second Example: Let’s visualize momentum when applied on derivatives of weight
The derivatives of weights on each data point can be very wiggly. I have plotted the derivatives of weight d on our 2-neuron model. Their only difference is whether they use ReLU and Momentum or not. “no ReLU” is a derivative in the model without ReLU, “ReLU” means a derivative in the model with ReLU, “under mom” means a derivative in the model using ReLU and Momentum, “mom” means the momentum value in the model using ReLU and Momentum.
Jeremy explained the use of momentum in two use cases which I only came to get my head around them when I do this experiment today. (Please correct me if I am still not accurate in my understanding on them)
Use Case 1: when derivatives are not jumping big and back and forth
Momentum: remember where you were and don’t jump at a full change, only increase the change or step size gradually, so that you can explore more fine space having lower probability of stepping over the optimal (see the graph from the experiment). When your larger steps meet a derivative shows opposite direction with large value, then you are forced to turn around, still increase your steps gradually like above. (also see Jeremy’s drawing below)
Use case 2: when derivatives are jumping in large steps and back and forth
when derivatives are jumping in large steps and back and forth, it is not very efficient in finding the optimal. But when derivatives do jump back and forth, momentum can keep them jump in smaller steps so that you are closer to the optimal than without momentum. ( see the graph illustration below)
There is something which is confusing when Jeremy talked about use case 1:
“If you are here and your learning rate is too small, if you keep doing the same steps, then if you also add in the step you took last time, and then your steps are getting bigger and bigger aren’t they? Until eventually they go too far …”
This quote sounds like “the same steps” are too smaller but after applied momentum, they will become bigger and bigger. But this interpretation can’t be right. According to our experiment below, if the same steps are always 1 and also greater than the initial momentum step 0, then the increasing momentum steps can never be bigger than 1. Then if sudden the same steps are changed to 0.5 and remain on this value onward, then the step value produced by momentum will decrease but is always greater than 0.5. (If I am wrong here, could anyone help me get this straight? thanks)
When I am walking through the code for momentum in fastai source, I could not find the test for average_grad in the fastai folder. But I did find a test in the docs. Is there a test for average_grad in the fastai source?
If no, does it mean average_grad won’t get tested every time when the library is updated?
Well a sleep certainly helps I found it in nbs/12_optimizer.ipynb. I guess the problem is ctags won’t find average_grad in nbs folder and github does not search in jupyter notebooks files.