Pseudo Labeling in Keras

mattobrien415 · December 28, 2016, 8:44am

I was watching Lesson 6 again, and was going over the parts on pseudolabeling. The example where we are using MNIST is pretty great, and is easy to follow, since the labels you predict can be easily concatenated into an array.

It seems a bit more difficult when your only notion of labels is indicated by the directory your image lives in.

Does anyone have any ideas on what kind of strategy would be effective to apply your predicted labels to the data for pseudo labeling, when this is your structure?

I imaging you could actually move your test images (after prediction) into the appropriate training directory if you had to. But this feels clumsy…

jeremy · December 30, 2016, 3:52am

I don’t see the problem here - sorry if I’m being slow. The labels are made available by the generators directly as an array. get_classes() in utils.py shows how to do this. But for pseudo labeling you want to use the predicted labels for the test set, not actual labels, so I’m not sure how this is relevant.

I do show pseudo-labeling for the fisheries competition in lesson7, so perhaps just check that out?

mattobrien415 · January 3, 2017, 5:40am

Ah yes, this was a really dumb question I asked! The answer is, ‘just concatenate it.’

Moving forward with the pseudolabeling, as I am sitting here implementing it, a nervous feeling is starting to grow on me.

We are basically:

Taking an end-to-end model we like / trust.
Using that model to predict classes for the test set, which is the set we will eventually submit to Kaggle.
Using this now larger set of data to create a presumably more effective model, we run now the test set through this final model – we now have predictions to submit to Kaggle.

But, wait, we are evaluating the test set using a final model…which was built (in part) on that same test set itself!

I was under the impression that is one of those fundamental machine learning no-nos – don’t expose data to the model that was used to actually build the model. Won’t we get overconfident probabilities, since the test set has already been seen, and those images have features that would presumably be easily recognized?

rachel · January 4, 2017, 1:41am

@mattobrien415 We’re not using any of the true labels for the test set (with Kaggle, we don’t even have access to those!) so we’re not exposing the model to the labeled test set.

mattobrien415 · January 5, 2017, 9:05am

we’re not exposing the model to the labeled test set.

Makes sense! But the situation that we are in, where we are using the model that we’ve created using (in part) the test set to further evaluate that same test set, feels intuitively shaky (or maybe it’s just me?). The same images are being seen by the model twice – once to build the model, and once to get a probability for those same images . At the very least, it’s interesting that there is not a potential problem here.

jeremy · January 5, 2017, 6:02pm

It’s very odd, for sure - but not shaky. In fact, using the unlabeled test data is the entire purpose of semi-supervised learning.

telarson · May 10, 2017, 1:11pm

@mattobrien415 I agree this feels shaky. It feels like your model would have a tendency to overfit the test data (which was converted to training data with the pseudo labels) and that if you ran your trained model on an unseen set of training data it would perform worse. I’ll have to play around and read more until I convince myself otherwise.

I get that we’re not using known labels for the test set it just feels like features learned during training time on pseudo label test data would then get activated by the same data at test time and cause overfitting.

jeremy · May 10, 2017, 5:34pm

You can’t overfit the test data if you don’t use the test labels!

telarson · May 11, 2017, 10:02am

I believe you but I’m going to need to beat that into my head by doing more semi-supervised learning.

Thanks!

Tait

brendan · May 11, 2017, 11:26pm

Here’s a question:

I’m working on the Amazon Rainforest competition (great for beginners btw!) and the Kaggle gods for some reason told us exactly which files in the test set will be used for the private leaderboard scores.

They’ve given us 40K images for the public leaderboard and 20K images for the private leaderboard.

Is there any advantage to pseudo-labeling one group vs another? Given the above point, I assume the answer is no.
Regarding pseudo-labeling ratios. My train split is 36K images, how many pseudo-labeled test images should I add? According to lecture the final ratio should be around 1/4 to 1/3 of the combined training set.
Are there any other techniques we can use to avoid overfitting the private set? One idea is to extract features from the training, public, and private sets like mean pixel intensity, and compare. Another is to simply browse a random sample of each and see if there are any obvious differences like a different location, etc.

jeremy · May 12, 2017, 9:57pm

I can’t see a reason to pseudo-label just one group. I’m not sure what ratio to use - try a few and see what works on the public leaderboard.

Definitely look carefully to see if the private dataset differs in any way at all. That’s a critical issue if you find it!

Master · November 20, 2017, 4:38pm

There is a problem here , I have read all previous answers yet to me it looks, using pseudo labels in this regards is like, randomly selecting and inserting validation items into our training set.
The problem is when the network correctly predicts a class, it assigns the true label. When a true label is assigned to an image, it is practically as if we are randomly selecting that image from validation set and simply adding it to the training set.
it is true that some images will have wrong labels, however, if we set aside the wrong labels, we see that at the very least, 50% are actually validation data with correct labels, (the actual percentage should be much higher I guess). and this doesnt make sense. espacially if we are going to test the resulting model on the validation set (with true labels) later.
Would you comment on this ?
Thanks alot

Yamano · November 22, 2017, 7:39am

Well, you are inserting true labels for data it already got right anyway, on the other hand your making it harder for it to get the wrong labels right since your pushing it to be wrong. I think in general it’s better when the valid/test are from different distributions than the training data, like state-farm, since in this case there is something important to learn outside the training-set.

The end-goal is to use pseudo-labeling on the test-set, not the validation set. You use the validation set only to get an indication that your pseudo-labeling parameters are good.
I think in the end it depends on your goal. If your goal is to do well on the test-set, like competitions, then it can help. But even in Kaggle competitions that isn’t really the end-goal. The end-goal there is for the competition creator to get a model that works well in the real-world, where there are no pseudo-labels. They can even take a winning model and retrain it on the whole set of data (Including test-data with real labels) to get a production model. In that case, by giving them a model which got a good competition result by using pseudo-labels, aren’t we sort of cheating them? I’d prefer a production model that generalized well without pseudo-labels, since that seems to have more potential on never before seen data.

WaterRocket8236 · November 23, 2017, 6:38am

Yes. Correct. Though pseudo labelling is a good idea but in order to use it in production I have to think about it too often. The reason is chances are it might use the wrong predictions on the test set as training.