I’ve been playing with gradient descent on a simple linear model as shown in Lesson 2’s Linear regression notebook. I created a JavaScript version in Tensorflow.js which allows you to play with the parameters and watch it train in real time.
But what this shows is that the a parameter in y=ax+b is always learned much faster than the b parameter. Why does this happen? Is it normal or is there a bug in my implementation?
one way to see is compute the gradient by hand: start by sampling some set of real numbers (x_i); now let’s generate our “data”, the input points will be computed X_i = a1x_i+b1 and the “target” Y_i = a2x_i+b2, our loss
L(a1,b1)=sum_i ((a1-a2)x_i + (b1-b2))^2 so the gradient w.r.t a1 of the loss is a sum of terms that look like
[2*(a1-a2)x_i+(b1-b2)]*x_i
whereas the gradient w.r.t b1 is:
[2*(a1-a2)x_i+(b1-b2)]
so they are basically the same except that the first weights (a1-a2) by x_i **2 which (unless all x_i are between -1 and 1, feel free to try that) is going to be much larger than x_i, so the first magnitude of the gradient is strictly larger than that of the second so that parameter will update faster.