I’m looking at a structured dataset which has two missing columns on the test dataset compared to the training: weight and y. The loss function is Weighted MSE, (I think this is wSE, but it’s what they’ve given.)

So I have to predict the y values (y hat), but I also have unknown weights in the test set and they affect the loss too.

The weights are significant: in the training set they have the following stats: max: 141.4 min: 0.26 mean: 14.2 st dev: 20.7

My first thought is to do something like the lesson 3/4 structured data example, but doing the whole thing twice: The first time I’d drop y from the training set, treat Weight as the dependent variable, and try to learn to predict the weights. Then I’d use that model to predict the weights of the test set. Then I’d use my training data with y included, and the test set with predicted weights, to try to predict y while minimising the wMSE score.

(I’m not 100% sure I’m following the problem setup, so let me know if I’m way off!)

I would think that you wouldn’t need to try to predict the test set weights. The only lever you have to lower the loss is shrinking (y_hat_i - y_i)^2. (The weights contribute, but they’re out of your control.) I would think that the only way the weights could help you is if they help you predict y_hat. But predicting the weights can’t give you any extra help predicting y_hat, since after all you predicted them with the same information you’re going to use to predict y_hat!

It’s true that if you don’t know or don’t predict the test set weights then you obviously can’t tell what your actual loss is. But if you do the best you can predicting y_hat, you’ll have done the best you can on the loss, regardless of what the weights turned out to be.

This does make sense actually… it’d certainly make the situation a lot simpler I see what you mean about how I’d be predicting y_hat with the same data I’d be using to predict the weights, and I think you’re probably right. I guess my thinking was that by having the weights (if I managed to learn them accurately) I’d have a better loss function to minimise (it’d be better to trade-off a lower weight vs a higher-weight and that’d be reflected in the network). But now I’m thinking that you might be right, and the simpler approach is probably better - it seems to tend to be in deep learning! Thanks for your comment