Discriminative learning rates when training from scratch

When trainkng a deep learning model from scratch would it be useful to still use discriminative learning rates?

I wouldn’t think so as the idea behind discriminative learning rates was that earlier layers have basic common features that do not need to be adjusted much and can use a lower learning rate, but here we have a new model which hasn’t learned anything yet so if anything discrimimative learning rates would slow down training of the earlier layers?

Am I thinking about this correctly?

2 Likes

Bumping this up because I think this question should get an answer, or at least some discussion regarding best practices for using discriminative learning rates.

My intuition is that discriminitive learning rates would only really work if you have sensible weights for your early layers. When you start with an uninitialized network this is not the case.

We can plot the activations from a network and see this for ourselves. If you want more information about the plots, see: https://github.com/JoshVarty/VisualizingActivations/blob/master/VisualizingActivations.ipynb

Below we train a ResNet-18 network with discriminitive learning rates:

learn = cnn_learner(data, models.resnet18, pretrained=False, metrics=[f_score])
learn.unfreeze()
learn.fit_one_cycle(10, max_lr=slice(1e-6,1e-2))

We can plot the activations from each conv1, conv2_x, conv3_x, conv4_x, and conv5_x. Below:

  • The x-axis represents time, as we train the network
  • The y-axis represents the magnitude of activations.
    • More yellow means more activations at a given magnitude.
    • More blue means fewer activations at a given magnitude.

Visualizing the activations:

The first layer doesn’t appear to change much as training progresses. This is likely because we’re using such a small learning rate.

If we instead use:

learn.fit_one_cycle(10, max_lr=(1e-2))

We can plot our activations again:

This looks much better! It seems like it will improve things as my f1score improved from 0.238104 to 0.468753 with a corresponding improvement in loss.

8 Likes

Thanks for the interesting visualizations. This is what I thought, but I see so many people (especially on Kaggle) using discriminative learning rates for models that are not pretrained and got good results. I need to try to get even better results without discriminative learning rates! :slight_smile:

1 Like

Yeah I saw that too. I think they (like myself) are using them out of habit. Most Kaggle competitions don’t forbid you from using pretrained networks so it’s easy to get into the habit of always using discriminitive learning rates.

1 Like

My intuition could be completely wrong, but it is that the effect of weight changes is related to the depth of layer in which they are made. Butterfly wings may cause a storm 10,000km away but not 10m away. I’d like a more reasoned answer as well.

1 Like

So you think that discriminative learning rates could be useful even when training from scratch?

It could be interesting to ‘bleed in’ discriminative learning rates during a from-scratch one_cycle training.