How to deal with missing labels when you have multiple losses?

My dataset has input_features, continuous_label1 and continuous_label2. I always have data for input_features but occasionally continuous_label1 or continuous_label2 has no value. I don’t want to remove the lines where this is the case because learning can still happen when one of the two values are missing.

My loss function is “MSE loss for continuous_label1” + “MSE loss for continuous_label2”. What is the proper way to handle this situation? One option would be to set the missing value as the prediction. Would that make sense?

Is there a reason you’re training the two simultaneously rather than treating them as separate models? If the goal is to predict label1 and label2 with the highest accuracy I would expect you’d get (admittedly slightly) better performance by treating the two separately. Are there other constraints to your problem that require you to calculate them at the same time?

In terms of your proposed solution, I’m not sure it is going to work in the way you expect. By setting the missing value to the prediction you’re going to have significantly lower error than the case where you have both values, so the gradient updates will be smaller. As long as that’s acceptable then you should be fine.

How about if you define your loss function to be

if continuous_label1 & continuous_label2:
    ("MSE loss for continuous_label1" + "MSE loss for continuous_label2") / 2
else if continuous_label1 & not continuous_label2:
    "MSE loss for continuous_label1"
else if not continuous_label1 & continuous_label2:
    MSE loss for continuous_label2"
else:
    skip training example

?

“if continuous_label1” is to be read “if continuous_label1 exists”

EDIT: instead of dividing by 2, it is better to use the weighted average

I would rather not have 2 separate models. The goal is not really to have high prediction accuracy but rather to use a layer of the network as a high level representation of the input. Having this embedding layer simultaneous optimized for the 2 losses is a plus.

Yep, that would work. I have to figure out how to implement that on a minibatch efficiently in Tensorflow.