Advice for pseudo-labelling with generators?

Hi, I’m fitting a finetuned model, and attempting to do pseudo-labeling.

In the pseudo-labeling example of the lesson 4, np.concatenate is used, but I’m getting a memory error, as my data is images, not features from convolution layers.

A couple of questions:

  1. Can I somehow combine (“concatenate”) two Keras generators in sequence?
  2. Can I specify labels for a generator in the code? This is obviously needed when doing pseudo-labeling.

Another option that came to my mind: can I alternate model.fit_generator() call for the true train data set and model.fit() call for the pseudo-labelled data set?

model.fit_generator(train_batches, batch_size=64, nb_epoch=1)
model.fit(pseudo_data, pseudo_labels, batch_size=64, nb_epoch=1)

This way I can’t run epochs automatically, but I could put those calls to a loop. Do I lose something by doing fitting this way? Will e.g. Adam work badly in this scenario?

I’m trying to figure exactly this out right now myself.

It seems to me like Keras handles generators and raw images/data entirely differently and I’m not sure how to override the generator, particularly the gen.flow_from_directory() output so that we can update the labels with our predicted labels.

The better approach in a loop would be to use train_on_batch(), and handle iterating through the batches yourself.

3 Likes

Would one possible solution be to setup a directory specifically for pseudo labeling? For example you could recursively copy the files from the training directory to seed your pseudo directory. Then use predict_generator on the data from your test directory, and copy the test data into their respective pseudo label directory sub-directories based off the predictions. Then you can effectively run fit_generator with data augmentation for pseudo labeling.

I had a similar issue. Training separate epochs on training data and then pseudo-labeled data gave bad results, I guess because each one takes the weights in a different direction.

Currently I’m using a simple ‘MultiGenerator’ class I created which generates batches from multiple generators:

class MultiGenerator(Iterator):
    def __init__(self,gen1,gen2):
        self.gen1=gen1
        self.gen2=gen2        

    def next(self):
        x1,y1=self.gen1.next()
        x2,y2=self.gen2.next()
        return np.concatenate((x1,x2)),np.concatenate((y1,y2))

You pass it to fit_generator() as usual with all the standard perks like workers.
For pseudo-labeling case for example you can make the first generator the standard flow_from_directory() on the training data with batch size for example 90, and the second generator be a flow() of the valid/test data with pseudo-labels if you can fit them in RAM, with batch size for example 10.
This will result in a generator with batch size 100 split 90/10 between train and pseudo-labeled data. This lets you define the split of train and pseudo data for training regardless of the size of the pseudo-labeled set. Also you can use different augmentation values for the pseudo-labeled data (Or none at all, I personally am not sure yet what’s preferred).

The downside is it’s not a true shuffle of train and pseudo data, since each batch contains exactly 90%/10% train/pseudo, and not on average, but this doesn’t sound like a big issue.

1 Like