Multilingual ULMFiT

tanny411 · June 2, 2019, 11:28pm

I would like to contribute for Bangla Language. Can someone give me a headstart? Are there any instructions to make the wiki dataset? Would be very helpful. Thanks. Also looking forward to using sentence piece.

abyaadrafid · June 10, 2019, 4:58am

The contact person listed in the ULMFiT for Bangla seems to be inactive for over an year. Is there anyone actually working on it?
Also I found this project in the wild. Has a wikipedia Bangla corpus; didn’t get the opportunity to check it out, might be useful to you.

I’m also trying to find a way to use wikipedia data dumps. I’ll share the dataset if I manage to do something.

tanny411 · June 10, 2019, 10:04am

I havent found anyone else working on Bangla. I am currently working on it.
I have actually checked out the project you mentioned. The dataset seems small. So I was thinking of building a larger dataset.

abyaadrafid · June 10, 2019, 10:14am

Here are the data dumps: https://archive.org/search.php?query=bnwiki&and[]=year%3A"2019"

Which platform are you working on?
Both kaggle kernels are Colab time out even before they finish training on IMDB example.

tanny411 · June 10, 2019, 10:28am

i was working on colab. where can i find the imdb dataset you are refering to?

abyaadrafid · June 10, 2019, 10:31am

The one in Lesson 3 video. Colab shows me 56 hours eta on training.
This one.

tanny411 · June 10, 2019, 10:32am

Are you sure you turned on GPU?

abyaadrafid · June 10, 2019, 10:39am

Facepalm
No I didn’t. Thanks.
In my defense though, I prefer kaggle over colab.

Anyways, I’ll try to find the best database files from the dumpster and try to get it into a reasonable file format.

Time to create a ULMFiT - Bangla Thread

tanny411 · June 10, 2019, 11:05am

I have completed ulmfit already using other texts instead of wikipedia. Was wondering about the effects of wikipedia texts.

abyaadrafid · June 10, 2019, 11:37am

Impressive. Then all the more reason we should open a ULMFiT-Bangla thread, please do the honors.

Then you can point me in the right direction. Where do I start? I read somewhere about an Indian language project, could you provide the link?

tanny411 · June 10, 2019, 11:52am

Sure will open a thread a soon. Not sure about the indian language project. Ive followed jeremys classes. Didnt use any separate language specific tokenization.

abyaadrafid · June 10, 2019, 1:49pm

Right. Please keep me in the loop when you do.
Cheers.

Antti_Karlsson · July 5, 2019, 9:03am

Hello everyone! I am working as a researcher at the Turku University Hospital. We have quite nice GPU resources here and I have trained a Finnish ULMFiT model on the Finnish wikipedia using the n-waves scripts. I reached a perplexity of about 23.6. I’ll double check with the employer if I can just open source the model, vocabulary and an example classification done on open data (city of Turku feedback classification with a few specific classes)

What is the best way of sharing the model and stuff if I get the green light? I can put them on github, but is there some other model zoo or something where the different models are more easily accessible?

Antti_Karlsson · July 16, 2019, 1:29pm

In case anyone is interested, here is a link to a finnish model trained on wikipedia with n-waves, got a validation perplexity of about 23.8:

A notebook to make a classifier is also included! Maybe that could be helpful for others too, who would like to use the pretrained models for experiments. At least the config n_hid=1150 thing caused me to lose a few hairs…

jamespaul · October 23, 2019, 10:49pm

First of all thanks for making the Dutch language model available! It helps greatly. I was searching to find a Dutch dataset that could be used for bench marking but couldn’t find any. For German I came across http://www.spinningbytes.com/resources/. Was wondering if you know of any Dutch datasets?

benjaminvdb · October 24, 2019, 7:34am

Hi James! I’m glad to hear the Dutch language model was of use to you.

Do you mean with benchmarking the performance of the language model on downstream tasks e.g. a classification? I’ve created a dataset for this purpose, the 110k Dutch Book Review Dataset (110kDBRD). You should be able to get around 94% accuracy on the out-of-the-box dataset.

jamespaul · October 25, 2019, 8:19pm

Thanks Benjamin for creating the Book review dataset and sharing it, great work! Actually I was looking for a public dataset that is used in an academic paper. Anyways, it doesn’t matter much. I used the Dutch language model, and I tried on a classification task - with around 600 samples per category I am getting close to 90% accuracy

tmjiang · October 30, 2019, 12:41am

Hi,

I’m trying to put some clues together here and will be really grateful to have your advices.

I’m recently re-investigating WeightDropout for the two-parameter-related issues and the usage of initializing the dropped weight with the identity function of F.dropout(training=False). The latter one has been asked in a dangling thread (Using F.dropout to copy parameters), and I figure that may be for QRNN’s discussion here and then the revisions of https://github.com/n-waves/fastai/commit/d60adca369f6e548a494109a849ea5ebb1061a61 and https://github.com/fastai/fastai/commit/b842586e9b080ed83afb251d4236ec6843d823de.
For the former one of the duplicated weights, I’m wondering if we can trick it like the original Salesforce version, which deletes the original weight in __init__() once and then put it back in forward(), such that the gradient will be picked up correctly without having an extra weight layer. It may have something to do with the frequently changing behavior of Tensor.is_leaf among different versions of PyTorch, according to the discussion I participate for AllenNLP’s DropConnect: Add workarounds to avoid _flat_weights issues for DropConnect #issuecomment-546670214. Perhaps the F.dropout(training=False) has something to do with initializing it with Tensor.is_leaf=True for the optimizer to add parameter groups correctly.

Please check my revision for DropConnectand let me know whether it is correct or not.

Thank you!

tmjiang · October 31, 2019, 3:28am

Please ignore most of it. My apologies for the distraction and the somewhat off-topic reply.

I just realize that as long as we pass non-frozen parameters to the optimizer, there’s simply no need to worry about duplicated weights and non-leaf tensors. (Probably because the duplicated weights shared the same value in __init__() such that Module._named_members() returns only one of them)

A slightly unclear thing is that F.dropout(training=False) in __init__() seems won’t have any effect now (perhaps it was to keep is_leaf=True for old versions of PyTorch?), unless it does some magic that QRNN requires. In other words, I regret that I didn’t post this under another seemly more relevant thread of Using F.dropout to copy parameters.

Again, sorry for the hassle.

tmjiang:

Hi,

I’m trying to put some clues together here and will be really grateful to have your advices.

I’m recently re-investigating WeightDropout for the two-parameter-related issues and the usage of initializing the dropped weight with the identity function of F.dropout(training=False). The latter one has been asked in a dangling thread (Using F.dropout to copy parameters), and I figure that may be for QRNN’s discussion here and then the revisions of Fix the QRNN performance #1108 without calling` reset()` in init · n-waves/fastai@d60adca · GitHub and Fixes the QRNN performance issue when no split_fn is provided during … · fastai/fastai@b842586 · GitHub.
For the former one of the duplicated weights, I’m wondering if we can trick it like the original Salesforce version, which deletes the original weight in __init__() once and then put it back in forward(), such that the gradient will be picked up correctly without having an extra weight layer. It may have something to do with the frequently changing behavior of Tensor.is_leaf among different versions of PyTorch, according to the discussion I participate for AllenNLP’s DropConnect: Add workarounds to avoid _flat_weights issues for DropConnect #issuecomment-546670214. Perhaps the F.dropout(training=False) has something to do with initializing it with Tensor.is_leaf=True for the optimizer to add parameter groups correctly.

Please check my revision for DropConnectand let me know whether it is correct or not.

Thank you!

sgugger · October 31, 2019, 1:29pm

I don’t remember much, except it was magically working with this, and not otherwise. It is very possible that it’s not necessary with the new version of PyTorch.

In terms or parameters, something weird happens as it is registered as a new parameter at first, but after the first iteration of training, it disappears. In any case, this is something I’ll try to clean up once v2 is finished.