Why pseudo-label when you can arbitrarily-label?! ;)

Just some random musing I wanted to share, and see what folks think.

I was thinking about fine-tuning and pseudo-labeling (each independently, and in the context of CNN image recognition). Both techniques seem to imply that some parts of a neural network are being used to understand / categorize / grok the input, while other parts of the network are turning that understanding / categorization / grokedness(?) into the desired output format. More specifically

  • fine-tuning – shows that you can separate parts of the model used to find significant image features, from the part of the model used to classify those features into an output.
  • pseudo-labeling – shows that even if the labels are sometimes wrong, the network is still able to learn something useful from them

So my question was, can I show that a CNN can learn useful weights even from arbitrary labels? The answer is pretty cool I think.

First I setup a simple model and grabbed Statefarm data.

def conv1(batches):
    model = Sequential([
            BatchNormalization(axis=1, input_shape=(3,224,224)),
            Convolution2D(32,3,3, activation='relu'),
            Convolution2D(64,3,3, activation='relu'),
            Dense(200, activation='relu'),
            Dense(10, activation='softmax')

    model.compile(Adam(lr=1e-4), loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit_generator(batches, batches.nb_sample, nb_epoch=1, validation_data=val_batches,
    return model

Then I took the training data, copied it into “training_arbitrary”, and shuffled the copy data such that images were randomly assigned to the c0, c1, …, c9 folders.

I trained two models, one on the real training data, one on the training_arbitrary data.

Model Normal

Epoch 1/1
20220/20220 [==============================] - 285s -
loss: 0.2348 - acc: 0.9412 - val_loss: 0.4316 - val_acc: 0.9446

Model Arbitrary

Epoch 1/1
20220/20220 [==============================] - 310s -
loss: 2.4156 - acc: 0.0994 - val_loss: 2.4057 - val_acc: 0.0690

As you can see, the arbitrary model scores 10% accuracy on the test set, which is what we’d assume from random labels.

So here’s the hypothesis: The arbitrary_model learned to grok the images just as well as the normal_model, so after I fine-tune both of them, I’ll see similar results.

To fine-tuned the models I set all pre flatten() layers to trainable.false, then fit them with the normal training set (no longer using the arbitrary training set).

Fine-Tune Model Normal

Epoch 1/1
20220/20220 [==============================] - 284s -
loss: 0.0149 - acc: 0.9984 - val_loss: 0.0283 - val_acc: 0.9936

Fine-Tune Model Arbitrary

Epoch 1/1
20220/20220 [==============================] - 282s -
loss: 0.4738 - acc: 0.8769 - val_loss: 0.1641 - val_acc: 0.9914

Surprisingly, they achieved comparable results!

Okay, so how much of this result is driven purely from fine-tuned layers? To approximate that, I created a model of just the fine-tuned layers (and ran two epochs because that seemed more fair, the other models got an initial epoch and then a fine-tune epoch).

Performance of Flattened Layers Only

Epoch 2/2
20220/20220 [==============================] - 282s -
loss: 0.4761 - acc: 0.8825 - val_loss: 0.1667 - val_acc: 0.9741

Not as good! So training a model on arbitrarily labeled data did help, and it helped almost as much as training on labeled data. (Given that you fine tune both models with normal data before measuring performance).

So, I guess here’s what I was hoping to get some feedback on

  1. Did I make a mistake that makes this whole line of thinking wrong? That’d be a bummer.
  2. Even if arbitrary-labels kinda work, is pseudo-labels ALWAYS practically better? …Probably…but…
  3. It seems that gradient decent will cause a network to effectively learn feature exaction (aka grokedness ;)) even with arbitrary-labels. If that’s true… then instead of properly calculating the correct gradient descent (from the correct labels), why not just pick an incorrect gradient descent (since you’d be calculating it from incorrect labels anyway). There must be some computation efficiencies that can be gained from relieving yourself from properly calculating gradient decent each batch. Moreover, if you’re just going to pick an incorrect gradient to use (because forcing a model to converge to anything seems to have value) you could throw unlabeled data at the network just fine. Right? Or have I left the reservation…
  4. Okay compromise, still calculate gradient decent every so often, but then then reuse that calculated gradient descent over multiple batches like it’s never going out of style. You can just finetune the model later anyway.

Thanks for the read,


I like your thoroughness and formatting.

Could you post a link to your notebook?

I’m confused about using arbitrary labels because I thought true labels (or labels connected to true labels (i.e. pseudo labels)) were necessary for pulling information out of the inputs, although unsupervised learning contradicts my thought.

I’m going to bet that the final dense layer is just learning to undo the random transformations of the previous added layers. The arbitrary labels are not providing any information to the additional layers (hence 10% accuracy before fine tuning with real labels).

The information is stored in the pretrained weights of VGG and the labels you used for fine tuning. That final dense layer has enough capacity to (almost) accommodate for the other random layers.

That said, I did come across a paper that used label noise as an additional form of regularization on top of dropout (can’t remember the reference now unfortunately). And you can find structure in unlabeled data (autoencoders, etc).