Training loss jumps after making more layers trainable

(Alex) #1

Hi all,

I’m trying to resolve a puzzler. I’m following v1 of the course; I’m on lesson 2. I am trying to train all the dense layers after finetuning the model. Surprisingly, the loss becomes catastrophically worse after I compile the new model (goes from 1.0 to 15.0 on a State Farm 10 class classification problem). Here’s the detailed breakdown of what I do:

  • I load the VGG model.
  • I use vgg.finetune() to finetune the last layer and switch it to 10 classes.
  • I train on a small sample set.
    • The loss starts high (~6) and over 30 epochs goes down to ~1.
  • I run predictions on a small test set.
  • I modify the model to make all dense layers trainable.
  • I run predictions on the same small test set, for later comparison.
  • I compile the model.
  • I run predictions on the same small test set, for later comparison. These three runs should have the same results.
  • I train the model on the sample set.
    • The loss goes to ~15
    • The loss never improves, over 30 epochs.
  • I run predictions on the same small test set again.
    • The first three sets of predictions match (i.e. predict the same classes). This is expected, since the model hasn’t change.d
    • The fourth, computed after the model is retrained, only ever predicts one category, always the same (‘c3’).

If anyone is interested, the Jupyter notebook is at

The first cell can be ignored (I believe…); it’s mostly setup. It’s the following cells that actually do the work.

One thing I tried is not calling compile(), which seems to work, but puts out a warning about the trainable parameter count not matching. I’m pretty sure that means I’m only train the old model. Predictably, the training continues correctly.

Any help is appreciated.

(Alex) #2

In case anyone comes across this:

Apparently, for a problem like the State Farm dataset, where running a few epochs doesn’t really come close to correctly classifying everything in the test set, much less overfitting, switching to training multiple layers early leads to instability; essentially the results aren’t even close to right, so the gradients must point in all sorts of absurd directions; it’s almost as if we hadn’t finetuned at all.

I realized this by switching to training only the last two dense layers, which behaved much better but still worse than expected: the loss initially jumped to 5, and then settled back down to where it had been after 30 epochs of finetuning.

Hope this helps anyone else who may run into something similar.