Just some random musing I wanted to share, and see what folks think.
I was thinking about fine-tuning and pseudo-labeling (each independently, and in the context of CNN image recognition). Both techniques seem to imply that some parts of a neural network are being used to understand / categorize / grok the input, while other parts of the network are turning that understanding / categorization / grokedness(?) into the desired output format. More specifically
- fine-tuning – shows that you can separate parts of the model used to find significant image features, from the part of the model used to classify those features into an output.
- pseudo-labeling – shows that even if the labels are sometimes wrong, the network is still able to learn something useful from them
So my question was, can I show that a CNN can learn useful weights even from arbitrary labels? The answer is pretty cool I think.
First I setup a simple model and grabbed Statefarm data.
def conv1(batches):
model = Sequential([
BatchNormalization(axis=1, input_shape=(3,224,224)),
Convolution2D(32,3,3, activation='relu'),
BatchNormalization(axis=1),
MaxPooling2D((3,3)),
Convolution2D(64,3,3, activation='relu'),
BatchNormalization(axis=1),
MaxPooling2D((3,3)),
Flatten(),
Dense(200, activation='relu'),
BatchNormalization(),
Dense(10, activation='softmax')
])
model.compile(Adam(lr=1e-4), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit_generator(batches, batches.nb_sample, nb_epoch=1, validation_data=val_batches,
nb_val_samples=val_batches.nb_sample)
return model
Then I took the training data, copied it into “training_arbitrary”, and shuffled the copy data such that images were randomly assigned to the c0, c1, …, c9 folders.
I trained two models, one on the real training data, one on the training_arbitrary data.
Model Normal
Epoch 1/1
20220/20220 [==============================] - 285s -
loss: 0.2348 - acc: 0.9412 - val_loss: 0.4316 - val_acc: 0.9446
Model Arbitrary
Epoch 1/1
20220/20220 [==============================] - 310s -
loss: 2.4156 - acc: 0.0994 - val_loss: 2.4057 - val_acc: 0.0690
As you can see, the arbitrary model scores 10% accuracy on the test set, which is what we’d assume from random labels.
So here’s the hypothesis: The arbitrary_model learned to grok the images just as well as the normal_model, so after I fine-tune both of them, I’ll see similar results.
To fine-tuned the models I set all pre flatten() layers to trainable.false, then fit them with the normal training set (no longer using the arbitrary training set).
Fine-Tune Model Normal
Epoch 1/1
20220/20220 [==============================] - 284s -
loss: 0.0149 - acc: 0.9984 - val_loss: 0.0283 - val_acc: 0.9936
Fine-Tune Model Arbitrary
Epoch 1/1
20220/20220 [==============================] - 282s -
loss: 0.4738 - acc: 0.8769 - val_loss: 0.1641 - val_acc: 0.9914
Surprisingly, they achieved comparable results!
Okay, so how much of this result is driven purely from fine-tuned layers? To approximate that, I created a model of just the fine-tuned layers (and ran two epochs because that seemed more fair, the other models got an initial epoch and then a fine-tune epoch).
Performance of Flattened Layers Only
Epoch 2/2
20220/20220 [==============================] - 282s -
loss: 0.4761 - acc: 0.8825 - val_loss: 0.1667 - val_acc: 0.9741
Not as good! So training a model on arbitrarily labeled data did help, and it helped almost as much as training on labeled data. (Given that you fine tune both models with normal data before measuring performance).
So, I guess here’s what I was hoping to get some feedback on
- Did I make a mistake that makes this whole line of thinking wrong? That’d be a bummer.
- Even if arbitrary-labels kinda work, is pseudo-labels ALWAYS practically better? …Probably…but…
- It seems that gradient decent will cause a network to effectively learn feature exaction (aka grokedness ;)) even with arbitrary-labels. If that’s true… then instead of properly calculating the correct gradient descent (from the correct labels), why not just pick an incorrect gradient descent (since you’d be calculating it from incorrect labels anyway). There must be some computation efficiencies that can be gained from relieving yourself from properly calculating gradient decent each batch. Moreover, if you’re just going to pick an incorrect gradient to use (because forcing a model to converge to anything seems to have value) you could throw unlabeled data at the network just fine. Right? Or have I left the reservation…
- Okay compromise, still calculate gradient decent every so often, but then then reuse that calculated gradient descent over multiple batches like it’s never going out of style. You can just finetune the model later anyway.
Thanks for the read,
Jon