Suppose you look at the selection of a particular U and I as a matrix multiplication by the user and item inputs, rather than as given or as table lookups. This equivalence is shown in that lesson. Then the computation of the rating takes the inputs and applies only multiplication and additions by constants. That is a linear model - there are no stacked linear layers with non-linearities between them.
I recall there’s a sigmoid at the end in order to map to the ratings range, and that would technically make the model non-linear. But it’s linear up to the output stage.
Furthermore, it’s non convex which means that SGD doesn’t necessarily converge to a local minimum either.
I think GD converges to a local minimum whenever the loss is continuous and differentiable. But in any case, most losses are not convex and we don’t expect to find a global minimum. We only require that GD finds a good enough local minimum, hopefully one that generalizes well too. Thus all the various optimizer techniques, learning rate/momentum schedules. etc. to get past bad local minima and saddle points. IMO, most of these techniques lack full theoretical support, but have proven to be effective in practice. Stochastic GD adds noise to the loss and gradient, making proofs harder and certain optimizers work better!
HTH, and feel free to correct my math and opinions.
This is the point I’m stuck on.
Multiplying and adding alone aren’t enough - for example, take the function f(x,y) = x*y+3. It’s clearly not linear (plot it).
This is the exact case we have here.
Oh, I see the confusion. It’s that x and y here are the weights, not the inputs. Weights are constants during a forward pass, changing only during weight updates. So x*y+3 would be a constant that is multiplied by the inputs: the user and the item as one-hot encodings. In terms of variation of the inputs, which are what vary when the model is applied, the model is linear. That’s what “the model is linear” means.
Here we are looking at the space of weights, Rn, mapping to R, the loss. This function is required to be piecewise differentiable so that various gradient descent methods will work. This loss function may or may not be convex, but that does not matter. We only want to find a good enough spot in weight space, one that gets the training and validation examples mostly right. The optimizers are specifically designed to get to a good enough spot quickly, without getting trapped in local minima along the way.
I think your point is that pure gradient descent may land in a non-global local minimum when the loss function is not convex. Certainly true, but not directly relevant. Instead we use stochastic gradient descent, with various momentums and rate schedules to keep the loss decreasing. Besides, the global minimum is probably not what we want - it’s likely to overfit the training set and not generalize well to new inputs.
Maybe this is all just heuristic and it works?
To me it looks like heuristics informed by theory and tested by experiment.