When applying SGD optimization, we always decrease the value of the weights, but why?
As far as I know, the meaning of derivative like dy/db is
“When the value of b change a little bit(close to 0), how would the function y change”?
But it do not say anything like _“decrease the value of b(b -= learning_rate * dy/db) can decrease lost”
Besides, there are many weights out there, when using SGD, we just decrease all of the weights all together, most amazing part is, this simple strategy work.
We don’t decrease the weights, we just change them in the direction opposite to the gradient.
When training using gradient descent, the quantity we are interested in is dLoss / dW. As you know, this quantity gives us the change in the loss as the weights increase slightly.
Let the gradient update W(i + 1) = W(i) - step_size * dLoss / dW.
Let’s see how this equation automatically ensures that we always update the weights in the right direction.
Assume that W(i) = 5 and step_size = 0.1.
a) dLoss / dW is positive, say 2.
A small positive change in W causes the loss to increase.
We should be making negative changes in W (or decreasing it).
W(i + 1) = 5 - 0.1 * 2 = 4.8, weights decrease.
b) dLoss / dW is negative, say -2.
A small positive change in W causes the loss to decrease.
We should be making positive changes in W (or increasing it).
W(i + 1) = 5 - 0.1 * -2 = 5.2, the weights increase!
I have made a number of simplifying assumptions in the above explanation, but the intuition carries over to the case of higher dimensions.
Thanks, your explanation is very intuitive, your clear my mind from miasma.