I think I am missing a piece of the puzzle when it comes to SGD, I wonder if anyone can point me in the right direction? My naive assumption is that using a linear model comprising the last fully connected layer of a network should behave in exactly the same way as a linear model comprising of all the fully connected layers of a network where only the last layer is trainable.
In lesson 2 under the section Train linear model on predictions, subsection Training the model, Jeremy gets the features from the penultimate layer of the CNN and then uses these as input to a linear model which he defines and compiles as
lm = Sequential([ Dense(2, activation='softmax', input_shape=(1000,)) ])
lm.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
This is great and speeds up the fine tuning process enormously.
In this lesson we split the model into conv_model and fc_model, to experiment with removing dropout. Instead of removing dropout I wanted to perform the same experiment as in Lesson 2 with the final fully connected layers. That is only set the last layer in fc_model to trainable=True. To achieve this I use the below approach where I copy the initial weights from lm to the last layer of fc_model, with the assumption that both models will now behave in the same way.
tmpModel = Sequential([
for l1,l2 in zip(tmpModel.layers, fc_layers): l1.set_weights(l2.get_weights())
tmpModel.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
fc_model = get_fc_model()
for layer in fc_model.layers: layer.trainable=False
fc_model.layers[-1].trainable = True
If I set opt=RMSprop(lr=0.01) I can train lm using lm.fit, however unless I reduce the learning rate to 0.001 I cannot train fc_model. By that I mean the accuracy stays around 0.5, from which I imply that I have overshot the minimum by choosing a learning rate which is too great.
If I set opt=SGD(lr=0.1) again I can train lm, however I have to reduce this to lr=0.001 to get fc_model to train.
What am I missing?