nn.Parameter always has zero gradient

I’m trying to create a module to replace input dropout in AWS_LSTM. Here i my code:

class LearnRNNDropout(nn.Module):
  def __init__(self, p=0.5):
    self.p = p
    self.scale = nn.Parameter(tensor([1.5]))
  def forward(self, x):
    if not self.training or self.p == 0.: return x
    sz = (x.size(0), 1, x.size(2))
    m = x.data.new(*sz).bernoulli_(1-self.p).div_(1-self.p)
    return x * (m * self.scale)

self.scale is supposed to give the network a way to scale the dropout matrix. I initialized it with 1.5 because that is clearly non-optimal, but the gradient of self.grad is tensor([0.], device='cuda:0', dtype=torch.float16) for the first 5 batches of training (after which is stop early).

Why would this self.scale parameter not get updated, i.e. a gradient other than 0?

I ran into the same problem, which comes from the fact that fastai looks for modules to create parameter groups in the optimizer, not for parameters. So your parameter will never get in a trainable group, so won’t be trained. Fastai added a ParameterModule layer in fastai.layers, you can use that by doing self.scale = ParameterModule(nn.Parameter(tensor([1.5]))). It should be able to train now.

1 Like

Thanks so much for your reply. If I do as you suggest, I can no longer multiply with self.scale. Do you happen to know how I would use the ParameterModule in this case?

So I got this to “work” doing return x * (m * self.scale.val) in the forward method, but the scale.val.grad is still 0. throughout training.

I indeed forgot to specify you have to use the val attribute. I’m not sure why it doesn’t work for you, it can come from the way the learner is created sometimes. By the way I noticed that in theory it should work directly with parameters with the standard pipeline, but using a module is safer anyway. How do you create your learner ? Does your parameter appear in learn.layer_groups ? And in learn.opt.param_groups ?

learn.opt is not defined and I cannot find it in learn.layer_groups. If you want to take a look at the notebook yourself, I put it here. See section “RNNDropout w/ learnable scaling”. Thanks a lot for your help!

Oh it’s learn.opt_func, not learn.opt, sorry for the mistake. I’ll look into your notebook to try and understand what happens.

I think problem comes from the split function you are using. Fastai uses the layer_groups attribute to feed in the optimizer, and it is created from the split function you pass in learn = LanguageLearner(data, model, split_func=config['split_lm'], **learn_kwargs) (function lm_learner). Problem is that this function doesn’t take your parameter into account:

def awd_lstm_lm_split(model:nn.Module) -> List[nn.Module]:
    "Split a RNN `model` in groups for differential learning rates."
    groups = [[rnn, dp] for rnn, dp in zip(model[0].rnns, model[0].hidden_dps)]
    return groups + [[model[0].encoder, model[0].encoder_dp, model[1]]]

Try not passing split_func=config['split_lm'] to LanguageLearner and check learn.layer_groups, you should see all layers here in theory, including your parameter. If you want to use discriminative layer training, I think you’ll need to create your own split function for this problem.

That’s it! If I just not pass a split_func to the Learner, it works. Thanks so much. It just didn’t occur to me at all, that the split function could act as a filter before the optimizer.

1 Like

Yeah fastai’s pipeline is a bit unnatural, as they add parameters to the optimizer from layer_groups and not directly from the model. It allows to use discriminative layer training easily, but requires to be very careful when splitting or just changing a model (for instance you can’t just do learner.model = new_model, you have to create a new learner).

Makes sense. I guess it’s something I should know after doing Part 2, but you just forget so much so quickly. Thanks for helping out!

1 Like

Indeed, and all of this should probably be better documented if someone has the time to do so and suggest a PR :slight_smile:

I could try and think about how to document this better, but I actually didn’t think about it for now. And I never looked into how documentation is actually written for fastai, so a PR might take some time to come. But I’ll definitely look into it.