Multilingual ULMFiT

Hi,

I’m trying to put some clues together here and will be really grateful to have your advices.

I’m recently re-investigating WeightDropout for the two-parameter-related issues and the usage of initializing the dropped weight with the identity function of F.dropout(training=False). The latter one has been asked in a dangling thread (Using F.dropout to copy parameters), and I figure that may be for QRNN’s discussion here and then the revisions of https://github.com/n-waves/fastai/commit/d60adca369f6e548a494109a849ea5ebb1061a61 and https://github.com/fastai/fastai/commit/b842586e9b080ed83afb251d4236ec6843d823de.
For the former one of the duplicated weights, I’m wondering if we can trick it like the original Salesforce version, which deletes the original weight in __init__() once and then put it back in forward(), such that the gradient will be picked up correctly without having an extra weight layer. It may have something to do with the frequently changing behavior of Tensor.is_leaf among different versions of PyTorch, according to the discussion I participate for AllenNLP’s DropConnect: Add workarounds to avoid _flat_weights issues for DropConnect #issuecomment-546670214. Perhaps the F.dropout(training=False) has something to do with initializing it with Tensor.is_leaf=True for the optimizer to add parameter groups correctly.

Please check my revision for DropConnectand let me know whether it is correct or not.

Thank you!