I am reviewing AWD-LSTM model from fastai2 module. It got me to a question related to WeightDropout
. From my understanding, it wraps on top of any nn.Module
and apply dropout mask(s) on the target weight(s).
As I inspected closer, I found that the target weights doesn’t yield any gradient in back-propagation while other weights do have gradients computed. It drives me to question:
In WeightDropout, how are gradients computed and propagated to the target weight(s)?
To illustrate my point, you can run the following code:
import torch
import torch.nn as nn
from fastai2.text.models.awdlstm import WeightDropout
lstm = nn.LSTM(3, 5, batch_first = True)
# target weight: weight_hh_l0
# non-target weight: weight_ih_l0
lstm_dp = WeightDropout(lstm, weight_p = 0.8, layer_names='weight_hh_l0')
test_input = torch.randn(8, 20, 3) # (batch size, seq length, input dim)
test_h = torch.randn(1, 8, 5)
test_c = test_h.data
output, (h, c) = lstm_dp(test_input, (test_h, test_c))
loss = output.sum()
loss.backward()
Check out non-target weights, they have gradients computed:
In [4]: lstm_dp.module.weight_ih_l0.grad
Out[4]:
tensor([[ 6.7713, 1.6536, -6.0267],
[ 6.0718, 9.1356, -2.9828],
[ -4.6106, -6.5487, 8.9050],
[ -5.4979, -2.3226, 0.9387],
[ -3.4466, 3.1723, -2.4880],
[ 0.5058, -1.8952, 0.3562],
[ -1.5476, -0.1757, 1.5508],
[ 4.5629, 2.8708, 1.5839],
[ 0.2175, 1.9155, -0.6714],
[ 0.1650, 0.6840, 0.1294],
[ 19.2782, 18.3211, -7.3397],
[-19.1672, -11.2854, 10.4764],
[ 5.6528, -2.5588, -1.9340],
[ 1.1073, 10.5333, 0.9745],
[ 2.7394, 0.5985, -1.1770],
[ 2.1402, -0.8321, 1.0183],
[ 3.1592, 6.3710, -3.9283],
[ -4.2480, -5.9663, 8.3711],
[ -2.5984, -0.1586, 1.3106],
[ -1.8626, 2.3050, -1.2497]])
Check out target weights in both lstm_dp.module.{weight}
and lstm_dp.{weight}_raw
, they don’t have gradients computed:
In [5]: lstm_dp.module.weight_hh_l0.grad
In [6]: lstm_dp.weight_hh_l0_raw.grad