When first fit the model learns, when code is re-run model does not learn

Hi! I am trying to apply fast.ai’s knowledge to a program of my own and I am facing a strange issue: I run my fit code and both the accuracy and validation accuracy increase periodically but when I re-run the same code the accuracy changes very slightly and the val_accuracy does not increase.

These results really puzzled me, any help would be much appreciated!

Note: I stopped the kernel when I could see where the results were going, so the number of epochs in both runs are not equal.

Here is the code and results:

1st run

    # Set the first 25 layers (up to the last conv block) to non-trainable - weights will not be updated
    last_conv_idx = [i for i,l in enumerate(model.layers) if type(l) is Convolution2D][-1]
    conv_layers = model.layers[:last_conv_idx+1]
    for layer in model.layers[:(last_conv_idx-14)]:
        layer.trainable=False
        
    model.compile(optimizers.SGD(lr=1e-5, momentum=0.9), # reduced learning rate
              loss='categorical_crossentropy', metrics=['accuracy'])

    fit_model(model, batches, val_batches, nb_epoch=50)

    Epoch 1/50
    979/979 [==============================] - 58s - loss: 8.4502 - acc: 0.3636 - val_loss: 7.5876 - val_acc: 0.3684
    Epoch 2/50
    979/979 [==============================] - 57s - loss: 7.1504 - acc: 0.4290 - val_loss: 6.6250 - val_acc: 0.4094
    Epoch 3/50
    979/979 [==============================] - 58s - loss: 5.4189 - acc: 0.5240 - val_loss: 5.9225 - val_acc: 0.4035
    Epoch 4/50
    979/979 [==============================] - 58s - loss: 4.2865 - acc: 0.5853 - val_loss: 4.6531 - val_acc: 0.4737
    Epoch 5/50
    979/979 [==============================] - 58s - loss: 3.3071 - acc: 0.6425 - val_loss: 4.2554 - val_acc: 0.5088
    Epoch 6/50
    979/979 [==============================] - 59s - loss: 2.6164 - acc: 0.6782 - val_loss: 3.7746 - val_acc: 0.5205
    Epoch 7/50
    979/979 [==============================] - 58s - loss: 1.9839 - acc: 0.7150 - val_loss: 3.4954 - val_acc: 0.4971
    Epoch 8/50
    979/979 [==============================] - 59s - loss: 1.6168 - acc: 0.7375 - val_loss: 3.3142 - val_acc: 0.5029
    Epoch 9/50
    979/979 [==============================] - 59s - loss: 1.2860 - acc: 0.7896 - val_loss: 3.1012 - val_acc: 0.4971
    Epoch 10/50
    979/979 [==============================] - 59s - loss: 1.1350 - acc: 0.7978 - val_loss: 3.0394 - val_acc: 0.4795
    Epoch 11/50
    979/979 [==============================] - 58s - loss: 0.8891 - acc: 0.8304 - val_loss: 2.9810 - val_acc: 0.4737
    Epoch 12/50
    979/979 [==============================] - 59s - loss: 0.7845 - acc: 0.8386 - val_loss: 2.9043 - val_acc: 0.5088
    Epoch 13/50
    979/979 [==============================] - 59s - loss: 0.6385 - acc: 0.8570 - val_loss: 2.8198 - val_acc: 0.5088
    Epoch 14/50
    979/979 [==============================] - 59s - loss: 0.4994 - acc: 0.8805 - val_loss: 2.7578 - val_acc: 0.4971
    Epoch 15/50
    979/979 [==============================] - 59s - loss: 0.4694 - acc: 0.8907 - val_loss: 2.7359 - val_acc: 0.5146
    Epoch 16/50
    979/979 [==============================] - 59s - loss: 0.3940 - acc: 0.8979 - val_loss: 2.7305 - val_acc: 0.5029
    Epoch 17/50
    979/979 [==============================] - 59s - loss: 0.3737 - acc: 0.9070 - val_loss: 2.7277 - val_acc: 0.5029
    Epoch 18/50
    979/979 [==============================] - 59s - loss: 0.3595 - acc: 0.9183 - val_loss: 2.7023 - val_acc: 0.5029
    Epoch 19/50
    384/979 [==========>...................] - ETA: 32s - loss: 0.2784 - acc: 0.9036

2st run

# set the first 25 layers (up to the last conv block)
# to non-trainable - weights will not be updated
last_conv_idx = [i for i,l in enumerate(model.layers) if type(l) is Convolution2D][-1]
conv_layers = model.layers[:last_conv_idx+1]
for layer in model.layers[:(last_conv_idx-14)]:
    layer.trainable=False
    
model.compile(optimizers.SGD(lr=1e-5, momentum=0.9), # reduced learning rate
          loss='categorical_crossentropy', metrics=['accuracy'])

fit_model(model, batches, val_batches, nb_epoch=50)

Epoch 1/50
979/979 [==============================] - 57s - loss: 1.0972 - acc: 0.4065 - val_loss: 1.0974 - val_acc: 0.3977
Epoch 2/50
979/979 [==============================] - 56s - loss: 1.0972 - acc: 0.3882 - val_loss: 1.0974 - val_acc: 0.3977
Epoch 3/50
979/979 [==============================] - 57s - loss: 1.0974 - acc: 0.3871 - val_loss: 1.0974 - val_acc: 0.3977
Epoch 4/50
979/979 [==============================] - 57s - loss: 1.0973 - acc: 0.3973 - val_loss: 1.0974 - val_acc: 0.3977
Epoch 5/50
979/979 [==============================] - 57s - loss: 1.0972 - acc: 0.4035 - val_loss: 1.0974 - val_acc: 0.3977
Epoch 6/50
832/979 [========================>.....] - ETA: 7s - loss: 1.0975 - acc: 0.3858

Something I’ve noticed about Keras is that when you interrupt training and then try training again that the model sometimes does not want to learn any more. Not sure why this is, perhaps there is some state that needs to be reset (the optimizer, for example). But it seems like that is what happened here. (To fix this, create a new instance of the model and train that.)

1 Like

Yeah, it’s super strange… I’ll reset and try again but if someone knows how to fix this without restarting the instance to re-train (which is quite a pain) I am open to suggestions.

Thanks a lot!

Just tried restarting the kernel and it still does not learn… Gets stuck at CVAcc = 0.3977.

If you do model.summary() how many trainable parameters do it have? Maybe you’re setting too many layers to trainable = False.

Total params: 23,104,323
Trainable params: 21,958,915
Non-trainable params: 1,145,408