Multilingual ULMFiT

@piotr.czapla I noticed that this branch is active for sometime now. Is there a deadline for merging with fast.ai master branch? Also, I understood that one of the goals of this thread is that there was a wish of using these pretrained models somehow in a conference which is past due. Is there still a reason to keep this as branch? Is there some goal we are trying to reach?

1 Like

We posted the paper and are waiting for the review I haven’t yet managed to merge all the code and update it to the newest fast.ai as we had another conference where we were using ULMFiT that we have learned about 2 weeks before the deadline so we had a thought time.

2 Likes

What is the setup required to clone and get https://github.com/n-waves/ulmfit-multilingual up and running locally?

Cloned the repo but I get errors when enabling qrnn or just trying to retrain running the code in the readme:

$ LANG=en
$ python -m ulmfit lm --dataset-path data/wiki/${LANG}-100 --tokenizer='f' --nl 3 --name 'orig' --max-vocab 60000 \ 
        --lang ${LANG} --qrnn=True - train 2 --bs=50 --drop_mult=0.5  --label-smoothing-eps=0.2

Thanks

Hey @piotr.czapla!

Seems we are on the same track. Did you manage to deploy a model trained with ULMFiT using Torch.JIT? If yes would be great if you could share your experience. :slight_smile:

@piotr.czapla, @sebastianruder

dumb question - if i understood correctly this thread is about creating ULMFit model per language. Are there any plans to create single model for multiple languages (like BERT)?

Hi, I’m working on ULMFiT for Croatian. I have went through this topic together with Language model zoo topic.

I have one question regarding those two topics. They have same agenda right? Agenda being creating and using ULMFiT for other (not yet implemented) languages. Or is Multilingual in the title specified for language model that can learn multiple languages (I get a different impression, hence the question)?

For creating language model I’ll use https://github.com/n-waves/ulmfit-multilingual and fastai v1 library.

Hello Marin. In my opinion this thread is more about the former than the later. Here we attempt to use ULMfit for different NLP domains in different languages and in most cases with grate success. Still, ideas on the model that “rule them all” are more than welcomed))

1 Like

Hi @piotr.czapla, if I use prepare_wiki.sh from your https://github.com/n-waves/ulmfit-multilingual, it will create following directories:

  • {LANG}-2 for small wiki with max number of tokens in each train:valid:test = 2000k:200k:200k (10:1:1)
  • {LANG}-100 for large wiki with max number of tokens in each train:valid:test = 98000k:200k:200k (49:1:1)
  • {LANG}-all for all wiki with max number of tokens in each train:valid:test = all:200k:200k (>49:1:1)

Is it on purpose that the number of tokens (at the end it is number of articles) of valid and test are the same for small, large and all wiki? In my case with indonesian wiki, I get 446632 articles as training dataset and only 256 articles in valid or test dataset for large wiki. I think the number of valid and test dataset are just too small comparing to training dataset in large wiki (could be worse in all wiki).

Hi, I need to know training time of LM, cause I need schedule GPU on server. I’m training LM for Croatian.

I mainly follow @duxan and @ademyanchuk notebooks for training LM models (they have also printed training times), so maybe you two could know most precisely. GPU is GeForce GTX 1080 (is this maybe too weak GPU for this). What GPUs you two used (or anybody that trained LM models for their respective languages)?

Hi.
I trained LM on GTX 1080ti. I used near 7Gb of gpu memory. It took ~26 hours to train for perplexity ~27. Hope this information helps and I correctly understand you question.

2 Likes

Could be but the perplexity is very poor estimation for the accuracy on downstream task it is there only ot make sure the model is being trained so I wouldn’t care too much about validation the smaller the faster the training :).

Hello,

I am doing the fastai course (2019), and I am new on this forum. I am interested in applying fastai to NLP, in the dutch language.
I looked at the pretrained models available in ulmfit-multilingual (pretrained_lm_models.zip) but it does not contain dutch.
Is there a pretrained dutch model available?

@JoepJ: You could try the language model I trained on a Dutch Wikipedia corpus for a couple of days.

Let me know how it worked out for you and whether you need any help. Good luck!

2 Likes

Thanks @benjaminvdb!
That saves a lot of time and effort :smile:.
I will check it out.

I would like to contribute for Bangla Language. Can someone give me a headstart? Are there any instructions to make the wiki dataset? Would be very helpful. Thanks. Also looking forward to using sentence piece.

The contact person listed in the ULMFiT for Bangla seems to be inactive for over an year. Is there anyone actually working on it?
Also I found this project in the wild. Has a wikipedia Bangla corpus; didn’t get the opportunity to check it out, might be useful to you.

I’m also trying to find a way to use wikipedia data dumps. I’ll share the dataset if I manage to do something.

I havent found anyone else working on Bangla. I am currently working on it.
I have actually checked out the project you mentioned. The dataset seems small. So I was thinking of building a larger dataset.

Here are the data dumps: https://archive.org/search.php?query=bnwiki&and[]=year%3A"2019"

Which platform are you working on?
Both kaggle kernels are Colab time out even before they finish training on IMDB example.

i was working on colab. where can i find the imdb dataset you are refering to?

The one in Lesson 3 video. Colab shows me 56 hours eta on training.
This one.