Advice for pseudo-labelling with generators?

tmu · January 17, 2017, 7:04pm

Hi, I’m fitting a finetuned model, and attempting to do pseudo-labeling.

In the pseudo-labeling example of the lesson 4, np.concatenate is used, but I’m getting a memory error, as my data is images, not features from convolution layers.

A couple of questions:

Can I somehow combine (“concatenate”) two Keras generators in sequence?
Can I specify labels for a generator in the code? This is obviously needed when doing pseudo-labeling.

tmu · January 17, 2017, 7:22pm

Another option that came to my mind: can I alternate model.fit_generator() call for the true train data set and model.fit() call for the pseudo-labelled data set?

model.fit_generator(train_batches, batch_size=64, nb_epoch=1)
model.fit(pseudo_data, pseudo_labels, batch_size=64, nb_epoch=1)

This way I can’t run epochs automatically, but I could put those calls to a loop. Do I lose something by doing fitting this way? Will e.g. Adam work badly in this scenario?

Even · February 22, 2017, 7:54pm

I’m trying to figure exactly this out right now myself.

It seems to me like Keras handles generators and raw images/data entirely differently and I’m not sure how to override the generator, particularly the gen.flow_from_directory() output so that we can update the labels with our predicted labels.

jeremy · February 25, 2017, 3:15am

The better approach in a loop would be to use train_on_batch(), and handle iterating through the batches yourself.

alecrubin · October 23, 2017, 6:55am

Would one possible solution be to setup a directory specifically for pseudo labeling? For example you could recursively copy the files from the training directory to seed your pseudo directory. Then use predict_generator on the data from your test directory, and copy the test data into their respective pseudo label directory sub-directories based off the predictions. Then you can effectively run fit_generator with data augmentation for pseudo labeling.

Yamano · November 11, 2017, 12:49pm

I had a similar issue. Training separate epochs on training data and then pseudo-labeled data gave bad results, I guess because each one takes the weights in a different direction.

Currently I’m using a simple ‘MultiGenerator’ class I created which generates batches from multiple generators:

class MultiGenerator(Iterator):
    def __init__(self,gen1,gen2):
        self.gen1=gen1
        self.gen2=gen2        

    def next(self):
        x1,y1=self.gen1.next()
        x2,y2=self.gen2.next()
        return np.concatenate((x1,x2)),np.concatenate((y1,y2))

You pass it to fit_generator() as usual with all the standard perks like workers.
For pseudo-labeling case for example you can make the first generator the standard flow_from_directory() on the training data with batch size for example 90, and the second generator be a flow() of the valid/test data with pseudo-labels if you can fit them in RAM, with batch size for example 10.
This will result in a generator with batch size 100 split 90/10 between train and pseudo-labeled data. This lets you define the split of train and pseudo data for training regardless of the size of the pseudo-labeled set. Also you can use different augmentation values for the pseudo-labeled data (Or none at all, I personally am not sure yet what’s preferred).

The downside is it’s not a true shuffle of train and pseudo data, since each batch contains exactly 90%/10% train/pseudo, and not on average, but this doesn’t sound like a big issue.