Whats going on with fine-tuning again?

(ben.bowles) #1

After we define the VGG model with keras, we are then loading the pre-trained weights into the model we just defined. VGG was trained on the entire image net database, right? But we are dealing with dogs / cats here so we want to improve the performance of the model specifically for dogs / cats

And so whats the difference between fine-tuning and fitting? The fitting part looks roughly what I’d expect, but the fine tuning part I havent seen before.

This is what I am guessing (Jeremy / Rachel please tell me if I am right). We are taking everything we can from the VGG model, up until the last layer, because we want the nice abstract/ high-level / object-like image features of the last layer of the network, but we just want to output different categories (now, dogs / cats , before 1000 different object categories).

So we are in a sense, modifying the VGG model such that we take the high level VGG network, doing no further training on it, but instead training a one layer neural network which maps the pretrained VGG features to a dogs / cats output.

If this is correct, then, the very fast performance per epoch actually makes more sense now.

def finetune(self, batches):
model = self.model
for layer in model.layers:
model.add(Dense(batches.nb_class, activation=‘softmax’, input_shape=(1000,)))
loss=‘categorical_crossentropy’, metrics=[‘accuracy’])

def fit(self, batches, val_batches, nb_epoch=1):
self.model.fit_generator(batches, samples_per_epoch=batches.nb_sample, nb_epoch=nb_epoch,
validation_data=val_batches, nb_val_samples=val_batches.nb_sample)

Lesson 1 discussion
(Jeremy Howard) #2

@ben.bowles exactly! The definition of finetune() you show makes it clear what’s going on. We’ll be studying this in detail tomorrow :slight_smile:

(Jeremy Howard) #3

BTW @ben.bowles it would be a wonderful help for all if you were to take your findings (including the code samples) and create a page on the wiki describing what you’ve learnt about fine tuning and how it works…

(Tom Elliot) #4

So finetuning a second time would stack another output layer on, so we probably want to make sure we only finetune an instantiation of the VGG class once?

What happens if you stack a bunch of extra layers on the back? Does it just take extra time to run through the model, or does it result in some kind of over specificity for the model?

(Jeremy Howard) #5

@tom actually it removes the last layer, and then adds a new layer. So you can do it many times - it has no impact beyond doing it once.

We’ll learn about adding extra layers later in class.

(ben.bowles) #6

Thanks @jeremy. Yes, perhaps I will create a page on the wiki illustrating the basic concept. Good suggestion!

(Tom Elliot) #7

I’ve started a wiki page here:

Could definitely use some polishing!

(Jeremy Howard) #8

It’s a great start :slight_smile:

(Swathi Shyam Sunder) #9

@ben.bowles, @tom - Thanks for this thread and the wiki.

So as per the wiki on Fine Tuning,

Fine tuning is a process to take a network model that has already been trained for a given task, and make it perform a second similar task.
Assuming the original task is similar to the new task, using a network that has already been designed & trained allows us to take advantage of the feature extraction that happens in the front layers of the network without developing that feature extraction network from scratch. Fine tuning replaces the output layer (layers?), which are capable of recognising and classifying higher level features. The new output layer (layers?) that are attached to the model are then “trained” to take the lower level features from the front of the network and map them to the desired output classes.

here is what I get.

With respect to the cats vs dogs example,

  • The original task would be classifying the images into the 1000s of Imagenet categories.

  • The new task would be to classify the images into just 2 categories i.e., cats or dogs.

  • From the definition of finetune, the last layer is removed/popped.

  • trainable is set to false for all other lower layers as they have already been trained (as part of the original task)

  • In order for the trainable false to take effect, the model needs to be compiled again, as per this and this is the last line in the function.

  • Here, we are adding our own new layer - model.add(Dense(batches.nb_class, activation='softmax', input_shape=(1000,))). From the documentation, I get the first two parameters. However, I am still not clear about the input_shape. The docs say the format should be (nb_samples, input_dim). We are passing 1000 for nb_samples. But why are we not passing anything for input_dim ??

Also, in the current version of vgg16.py, the function finetune is a bit different i.e., input_shape parameter is not passed to Dense at all.

What is the significance of this parameter and what is the reason for removing it?

(Jeremy Howard) #10

Great summary! The input_shape parameter is only relevant for the very first layer of a network - it should never have been included in the finetune function; it was a copy/paste error. The finetune function is adding a layer to an existing network, so that parameter is ignored (the input shape is derived by looking at the output shape of the previous layer).

(Jeremy Howard) #11

I’ve updated the wiki from @tom with some minor clarifications, and added the details provided by @swathi.ssunder . Many thanks to you both.

(Swathi Shyam Sunder) #12

Now if I think of your reply and the Spreadsheet example that you explained in class, it seems to make sense about the input_shape not passed.
Thanks @jeremy.

(janardhanp22) #13

Wondering if the below line can be rephrased ? @jeremy

trainable is set to false for all other lower layers as they have already been trained (as part of the original task)

New :
trainable is set to false for all other hidden layers as they have already been trained (as part of the original task)

(Jeremy Howard) #14

Sorry @janardhanp22 I don’t understand the question. Can you please provide full context?

(janardhanp22) #15

@Jeremy Sorry for the lack of clarity. I meant that when we read the wiki page http://wiki.fast.ai/index.php/Fine_tuning
and consider penultimate line, this can be rephrased/edited from "“lower layers” to “hidden layers”

(Jeremy Howard) #16

Thanks for the context.

No that wouldn’t be correct. It’s not all hidden layers that are set to trainable=False, but just some that are earlier in the network where you don’t want to finetune them.

(carlos roberto) #17

Questions about fine tunning and fit.
Considering lesson1, what I had understood was that when we fit the model, we are adjusting( fit) 1000 classes to 2 classes (I also see this as a kind of tunning) and then when do finetune, we are trying to improve the prediction of our model.
So, when I read the comments above what I understand is that the reduction of classes occur in finetunning and not in fit process. What am I missing here?


I am trying to understand the limitation of transfer learning (fine tuning). From this reference http://cs231n.github.io/transfer-learning/, I understand that transfer learning will work if the new data set is large, even if the new data set is entirely different from the original data set.

My question is: If a pre-trained model is fine tuned for entirely different data set and for all inner layers of network, trainable = false, then how does it learn about new image data set?

Can you please clarify?
Thank you.

(Angel) #19

@Chandrak I don’t read the cs231 recommendation as you do, point 4 says that if dataset is large and different from the original it is worth to try to retrain the whole ConvNet



You are right. I re-read the cs231n reference. I have a follow-up doubt on this.

In below code, when fine tuning is done, all inner layers are set to False. In next step, model is trained on data.

 batches = vgg.get_batches(path+'train', batch_size=batch_size)
 val_batches = vgg.get_batches(path+'valid', batch_size=batch_size*2)
 vgg.fit(batches, val_batches, nb_epoch=1)

If the new data set is different from the original data set, then how does below line of code will help in training (for inner layers, trainable = False)


Please let me know.