Extra input features give worse loss?

Pomo · July 16, 2019, 6:14pm

Here’s a counter-intuitive situation I found while working on a real-life problem. I do not understand it.

The highly simplified version:
X is a set of training vectors, each with length n.
T is the target, a single number for each vector X.

Y = Linear(n, 1)(X)
loss = MSELoss(Y, T)

The above model trains well using ADAM and seems to converge onto a minimum loss. It’s just a linear regression of vector X onto T.

Next, I append to X, m more numbers that I think will help predict T better. In other words, X’ now contains what it did before plus extra information.

Y = Linear(n+m, 1)(X’)
loss = MSELoss(Y, T)

This model trains more slowly, as expected, because it has m more parameters. However, it lands on a significantly higher loss than than the model with shorter inputs.

I’d think it should be at least as good as the first model. In fact, if Linear just finds zero for the parameters corresponding to the extra information, it would exactly match the first situation. But training does not find even this.

What’s going on here? Can anyone explain?

P.S. The extra values in X’ are correlated with (not identical to) what what was already in X. Maybe this is a piece of the puzzle.

kushaj · July 17, 2019, 6:30am

With MSELoss you cannot get 0 for your parameter values. You have to use L1 loss. Also, the linear model is underfitting when add more features as it does not have enough capacity to learn your data.

Pomo · July 19, 2019, 11:23pm

Thanks for taking the time to reply.

I see your point that MSELoss can’t make weights reach zero, because their gradients go to zero. They can though get arbitrarily close to zero even with MSELoss. In any case, I tried L1Loss. It trains a small amount faster than MSELoss but converges to exactly the same solutions for weights in both cases.

Next, I tried an experiment:

Make a toy test case with two inputs (a), and six inputs (b), including the original two.
Train (a) until loss is stable. Train (b) for 10x as long. Again, the loss for (a) is slightly lower than that for (b). However, the weights for the original two inputs are very different between (a) and (b).
For case (b), insert the best weights and bias found in case (a), setting the other four weights to zero.
As expected, the loss for case (b) starts out the same as for case (a).
From this starting point, train (b). This time, however, the loss for (b) does go less than the loss for (a), as one would expect. The weights ultimately found by (b) are just about the same as those found by (b) with random initializations, not the ones it was initialized with.

Conclusions:

I did not train (b) long enough to see its loss go below (a)'s loss, even though it was trained much longer. I was wrong in my original conclusion.
The model with 6 features trained much, much, much slower than the one with 2 features.
It was faster overall to train first with two features, and then use those weights to initialize the features in common with a larger model. Maybe this concept will prove to be useful someday, provided you can find a way to transfer weights from a simpler model into a more complex one. (It was possible for this particular example.)