Multilingual ULMFiT

tmjiang · October 30, 2019, 12:41am

Hi,

I’m trying to put some clues together here and will be really grateful to have your advices.

I’m recently re-investigating WeightDropout for the two-parameter-related issues and the usage of initializing the dropped weight with the identity function of F.dropout(training=False). The latter one has been asked in a dangling thread (Using F.dropout to copy parameters), and I figure that may be for QRNN’s discussion here and then the revisions of https://github.com/n-waves/fastai/commit/d60adca369f6e548a494109a849ea5ebb1061a61 and https://github.com/fastai/fastai/commit/b842586e9b080ed83afb251d4236ec6843d823de.
For the former one of the duplicated weights, I’m wondering if we can trick it like the original Salesforce version, which deletes the original weight in __init__() once and then put it back in forward(), such that the gradient will be picked up correctly without having an extra weight layer. It may have something to do with the frequently changing behavior of Tensor.is_leaf among different versions of PyTorch, according to the discussion I participate for AllenNLP’s DropConnect: Add workarounds to avoid _flat_weights issues for DropConnect #issuecomment-546670214. Perhaps the F.dropout(training=False) has something to do with initializing it with Tensor.is_leaf=True for the optimizer to add parameter groups correctly.

Please check my revision for DropConnectand let me know whether it is correct or not.

Thank you!

tmjiang · October 31, 2019, 3:28am

Please ignore most of it. My apologies for the distraction and the somewhat off-topic reply.

I just realize that as long as we pass non-frozen parameters to the optimizer, there’s simply no need to worry about duplicated weights and non-leaf tensors. (Probably because the duplicated weights shared the same value in __init__() such that Module._named_members() returns only one of them)

A slightly unclear thing is that F.dropout(training=False) in __init__() seems won’t have any effect now (perhaps it was to keep is_leaf=True for old versions of PyTorch?), unless it does some magic that QRNN requires. In other words, I regret that I didn’t post this under another seemly more relevant thread of Using F.dropout to copy parameters.

Again, sorry for the hassle.

tmjiang:

Hi,

I’m trying to put some clues together here and will be really grateful to have your advices.

I’m recently re-investigating WeightDropout for the two-parameter-related issues and the usage of initializing the dropped weight with the identity function of F.dropout(training=False). The latter one has been asked in a dangling thread (Using F.dropout to copy parameters), and I figure that may be for QRNN’s discussion here and then the revisions of Fix the QRNN performance #1108 without calling` reset()` in init · n-waves/fastai@d60adca · GitHub and Fixes the QRNN performance issue when no split_fn is provided during … · fastai/fastai@b842586 · GitHub.
For the former one of the duplicated weights, I’m wondering if we can trick it like the original Salesforce version, which deletes the original weight in __init__() once and then put it back in forward(), such that the gradient will be picked up correctly without having an extra weight layer. It may have something to do with the frequently changing behavior of Tensor.is_leaf among different versions of PyTorch, according to the discussion I participate for AllenNLP’s DropConnect: Add workarounds to avoid _flat_weights issues for DropConnect #issuecomment-546670214. Perhaps the F.dropout(training=False) has something to do with initializing it with Tensor.is_leaf=True for the optimizer to add parameter groups correctly.

Please check my revision for DropConnectand let me know whether it is correct or not.

Thank you!

sgugger · October 31, 2019, 1:29pm

I don’t remember much, except it was magically working with this, and not otherwise. It is very possible that it’s not necessary with the new version of PyTorch.

In terms or parameters, something weird happens as it is registered as a new parameter at first, but after the first iteration of training, it disappears. In any case, this is something I’ll try to clean up once v2 is finished.

tmjiang · November 1, 2019, 5:55am

Thank you very much for the information. I’ve checked and failed to find any clue form latest two changes of PyTorch’s nn.functional.dropout (Move dropout and alpha dropout to ATen by apaszke · Pull Request #10384 · pytorch/pytorch · GitHub and then [jit] Convert functional dropouts to weak script by driazati · Pull Request #13484 · pytorch/pytorch · GitHub)

To my best knowledge, after the parameter registration, that something may hide in the following steps (probably [1.2] and [3]):

After the weight-dropped layer and the whole model initialized, someone creates an optimizer with parameter value IDs from:
1. .parameters() (or .named_parameters()) plus some preprocessing such as filtering by .requires_grad for frozen layers;
2. ._named_members() subsequently. It returns unique value IDs. Either the top module’s parameter takes precedence, or a random one pops up if duplicated parameters are at the same module (but probably no one will register it at the same module anyway).
During the first iteration:
1. Before .forward(), parameters inside the optimizer should all have is_leaf=True otherwise the optimizer will complain. Luckily [1.2] takes care of it.
2. After .forward(), the original parameter will have a different value of weight-dropped Tensor (without a wrapper of type Parameter), so anyone uses .parameters() will have an additional parameter from now on.
3. .backward() accumulates grad’s to the _raw one, but not the original one that has .is_leaf=False.
4. .step() updates the _raw one’s weight.
If somebody runs .eval() and .forward() for validation, it will remove the weight-dropped Tensor from .parameters(), because F.dropout(training=False) makes the duplicated parameters share the same value again.

I’d like to help when the time comes.

zaradana · October 15, 2020, 5:55pm

I am interested in using Multifit for zero-shot learning. However, I don’t see that code in the repo. Can you please share the code for zero-shot learning?