Lesson 3 official topic

thanks!

1 Like

Hi Everyone!

After watching the lesson 3 video, I am stuck on chapter 4 of the notebook, and I was hoping someone could help me out!

I am doing the “End to End SGD Example” part of chapter 4.

Can someone tell me why the ‘speed’ variable has been assigned with the following formula: “torch.randn(20)3 + 0.75(time-9.5)**2 + 1”? - Why are we using this specific formula in this specific manner? Is this formula supposed to be similar to a quadratic equation? Why are we subtracting ‘9.5’ from the ‘time’ variable in this formula?

I have attached a relevant screenshot as well.

I am trying to use Gradient Notebook on Paperspace.
This is my first time, and I just simply created an account (didn’t link any of my credit card).
But whenever I select Paperspace + Fast.AI, I don’t see any available machine, even the free one.
Do I need to at least update payment method in my account to use the free machine?

@jeremy should the HuggingFace Spaces Pets repository listed here be in the resources for Lesson 2 instead? Lesson 2 (and chapter 2 of the book) seem more related to pet classification than Lesson 3.

Checkout https://www.learnpytorch.io and the official get started page in PyTorch

1 Like

Google Sheets solution - lesson 3.

Results
The solver plugin required to get this working in Google Sheets is a bit limited, however you’ll see it was able to make accurate predictions (Loss of 0.00) for a number of passengers after a few iterations.

Note:

This is an aspect that I was also confused by.

Using the example given in lesson 3 of finding the quadratic ax^2 + bx +c that best fits the “noisy” dataset.

I’ve been thinking of it like this.

  1. We have a noisy plot, this is our “dataset”,
  2. We plot an arbitrary quadratic, with initial values for params a,b,c on the same graph, calling this the prediction:
    y_pred = ax² + bx + c
  3. For each point in our dataset we calculate the “loss”, i.e. distance of this point to the corresponding point on the quadratic (corresponding here means, the points we’re measuring the distance between have the same x value, another name for the x variable here is “the independent variable”) that we just drew.
loss = y_pred - y_actual
  1. We then find the mean of all these losses.
MSE = 1/n * Σ (y_pred - y_true)²

Expanding out the y_pred - y_true part of this (i.e. the loss e, without the mean square part), we can see our parameters within this loss function like this:

e = y_pred - y_true = ax² + bx + c - y_true

Our MSE can now be written as:

MSE = 1/n * Σ e²
  1. So at this point we have a mean, which involves all our parameters (a,b and c) but we want to see how this loss changes with respect to just one of these parameters, let’s say b. As b changes the loss will change, so I’m guessing under the hood we do this for a few values of b (would love to confirm this), this gives us datapoints to draw another plot, this time of b on the x-axis and MSE on the y-axis. The gradient of this line at the current value of param b is the gradient we want to calculate. Which we do, as you mention using the gradient of the tangent to the line at that point, aka, “the derivative” or ∂MSE/∂b.

  2. If this gradient ∂MSE/∂b , which is called the derivative of MSE with respect to b is positive we know we want to decrease b, and vice versa for the next layer of our model.

Hope I am correct in all that and it makes sense. I realize there are some chain rule derivations that can be done on MSE = 1/n * Σ e² to express all this mathematically but I am not comfortable with the chain rule right now.

1 Like

That reasoning makes sense to me. I view the MSE as a function defined on the a,b,c axes and and the gradients of MSE with respect to each variable a,b,c are used to move in the direction of the minimum of the function.

I had one question for the process of using gradients to reduce the MSE(cost). If we find the gradient and adjust the value of the variable to cause the MSE to decrease isn’t it possible that we move our solution into a direction that is not towards the true minimum? For example in the image below the gradient from the current location leads to a lower MSE but this movement is in the opposite direction of the true minimum value of the MSE.

Am I missing something in my understanding here?

1 Like

That is an actual issue; I’m not too aware of the methods there are that can overcome this, though I assume the fastai Learner class has optimizers that have algorithms built in to overcome this.

But this issue is why there can be a lot of variation when training a model: sometimes you begin at a more favorable location in the optimization landscape, and other times you don’t. Luck can play a factor.


Some extra information you may be interested in.

The method you describe is known as “Greedy Search”. It only optimizes for short-term gain. It ignores any negative changes and accepts only positive changes. So it’s very easy to get stuck in a local optimum with this method.

One method of overcoming this is known as “Simulated Annealing”. You have a value that’s known as the temperature. The higher the temperature, the more likely it is that a negative change will be accepted. During the early stages of training, the temperature is very high, meaning either a negative or positive change is accepted. As training progresses, the temperature gradually decreases and so the chances of accepting a negative change decrease.

The probability whether a negative change is accepted is defined by the following formula.

e^{-\frac{S_{\text{old}} - S_{\text{new}}}{T}}

S_{\text{old}} is the old score/loss/cost, S_{\text{new}} is the new score/loss/cost, and T is the temperature.

1 Like

Thanks! That makes sense. The simulated annealing thing is also very interesting and it seems a little similar to decaying learning rate talked about in lecture – both restrict movement as we move closer to a solution.

Yes, I suppose that is one way to create a decaying learning rate — haven’t really thought of it like that heh.

1 Like

Suhaas, I had the same question, this is what I figured out.

If you use Stochastic Gradient Descent, instead of vanilla Gradient Descent (which is plotted using the entire dataset for a given independent variable), SGD uses random subsets (mini batches) of the data to calculate the loss for a given value of the independent variable, this randomness this can help jump out of shallow local minima.

  • These are called Noise Induced Jumps Using mini-batches causes this “noise” or “vibration” in the gradient - as we change the independent variable, the gradient is going to be inconsistent - and may on occasion reverse the gradient direction, which would result in us escaping the local minima.
    • Probably not a very helpful analogy but - imagine if there was a ball bearing rolling around your diagram, imagine plucking the plot like a guitar string, this would help it jump out of a local minima and it’d eventually find the global minima.
  • Mini batches also mean the parameters are updated more frequently - basically I think it means we can calculate the loss multiple times for each independent variable, taking an ‘average’ gradient or similar.

That said, SGD isn’t guaranteed to find the global minima for non-convex loss functions.
But even with this limitation It’s worth noting that other variables that are probably worth thinking about… particularly as there are a lot of variables at play and the model is working in high dimensionality. Off the top of my head, these could work, and are likely already part of the model development workflow.

  • Using Ensembling to mitigate one model that’s stuck in the local minima.
  • Tweaking learning rate.

Finally, I’ve read that in practice people don’t worry a great deal about this because in practice a local-minima isn’t strictly a bad thing, and often doesn’t result in significantly different accuracy. Basically there are Many Equivalent Solutions

In a neural network with many layers and many neurons per layer, there are often many different combinations of parameters that produce equivalent or nearly equivalent results. This means that even if you find a different minimum each time you run the training algorithm, the practical differences between these solutions may be negligible.