Pseudo Labelling

sakiran · February 14, 2017, 5:51am

I didn’t understand why does Pseudo labelling makes sense. Isn’t it like, already seeing the mapping of input and output and then predicting for the same inputs?

That is, how is it generalizing when your model has already seen those inputs?

dmhv · February 14, 2017, 9:04am

From what I understand, it should mostly be used when you have a lot of unlabelled data, but few labelled data. Pseudo-labelling process in a nutshell:

Take an unlabelled item.
Predict its label.
Pretend this is now a labelled item.
Use this “labelled” item to fit your model.

Do note that in this case you don’t actually know the label of the item until you predict it - if you did, you wouldn’t have needed the prediction. This isn’t the same as training the model on both the training set and the test set. You perform the usual procedure (train on the training set, validate on a validation set, once model is picked, check how it generalizes with the test set), but you augment your training set with some unlabelled data which has gone through pseudo-labelling process.

Whether or not this augmentation results in better or worse performance on previously not seen items is probably something that should be tried and tested in each specific case.

sakiran · February 15, 2017, 3:07am

Then how can one validate the results? That is, isn’t it possible that the model is learning wrong mapping because what we predicted is wrong?!

And also if the number of classes are more then the chance of predicting the wrong classes increases – Making our model learn wrong mapping?

dmhv · February 15, 2017, 6:27am

One should validate results using the same techniques one uses without pseudo-labelling. From what I understand, of course the model can become worse due to some of the predictions being wrong - maybe that would be compensated by the predictions that are right.

I don’t think the number of classes would have a big impact here. It’s not like we’re trying to take a model which is just slightly better than random guessing, predict a bunch of labels and hope it to improve. As far as I understand, the model already has to be pretty good before you even start considering pseudo-labelling. The purpose of the latter is merely to improve something which is decent already.