Gradient descent on linear model

I’ve been playing with gradient descent on a simple linear model as shown in Lesson 2’s Linear regression notebook. I created a JavaScript version in Tensorflow.js which allows you to play with the parameters and watch it train in real time.

But what this shows is that the a parameter in y=ax+b is always learned much faster than the b parameter. Why does this happen? Is it normal or is there a bug in my implementation?

one way to see is compute the gradient by hand: start by sampling some set of real numbers (x_i); now let’s generate our “data”, the input points will be computed X_i = a1x_i+b1 and the “target” Y_i = a2x_i+b2, our loss
L(a1,b1)=sum_i ((a1-a2)x_i + (b1-b2))^2 so the gradient w.r.t a1 of the loss is a sum of terms that look like


whereas the gradient w.r.t b1 is:


so they are basically the same except that the first weights (a1-a2) by x_i **2 which (unless all x_i are between -1 and 1, feel free to try that) is going to be much larger than x_i, so the first magnitude of the gradient is strictly larger than that of the second so that parameter will update faster.

An intuitive explanation I can think of is that since you’re comparing two lines, your loss is strongly influenced by the angle between the lines.

I also imagine the gradients would be different. Taking the simple example of
y = m•x + b
\frac{\delta y}{\delta m} = x
\frac{\delta y}{\delta b} = 1

Of course the model works on a loss function with the matrix calculus variant but the core principal is the same.