Multilingual ULMFiT

(Piotr Czapla) #21

You have good Intuition. If not for a bug, the issue would be fixed by simply calling .reset.
Unfortunately QRNNLayer does not propagate reset to its linear layer which is wrapped by WeightsDropout that needs the reset. The issue is not present in LSTM as it has properly implemented reset() method.
Here is the problematic line that does not call reset on self.linear that is later wrapped with WeightsDropout

I’m not sure why you don’t have this issue though, tomorrow I’m going to create some unit test to illustrate the issue and propose a fix.

Besides I’ve noticed another bug in WeightsDropout that makes the underlaying module.weights appear twice in the resulting state_dict.


Not sure about the propagation of the reset call, but the WeightDropout has two parameters registered to deal with weight dropout: original ones and the dropped-out ones. This is the only we managed to do Weightdropout for now, so not a bug.

(Piotr Czapla) #23

@sgugger The issue does not appear when split_func=lm_split is passed to LangaugeLearner, and this is why you couldn’t notice it. I’ve created some tests to illustrate the issue (due to cupy dependency on cuda the test don’t run on azure pipelines). The PR has a fix to the issue so that use of split_func is not required.

(Piotr Czapla) #24

I’m a bit unhappy that I’ve started discussing qrnn issue in this thread, as it makes it harder for ppl to contribute.
To fix the issue let me make a summary where we are for anyone that would like to contribute and don’t want to read the whole conversation above:

How to contribute

Important update

I decided to reset n-waves/fastai:ulmfit-multilingual reset to master as it was previously based on fastai/fastai:ulmfit_v1 and this was making the PR process hard if we didn’t want to force @sgugger to merge ulmfit_v1 as well to master.
So @aki58 if you fetched n-waves/fastai:ulmfit-multilingual as described in this readme: please reset your branch by following this steps:

Where we are

We have the qrnn fixed. And the other things like BiLM or support for XNLI is still to do.
But Instead of listing this things here I’ve created a project in github. Let’s give it a try to manage our todo list:

(Piotr Czapla) #25

I’ve updated our repos to the yesterdays refactor of fastai there was some breaking changes. Please pull n-waves/fastai:ulmfit_multilngual and n-waves/ulmfit_mutilingual:master. The pretrain_lm script works I’m not sure regarding orders as we don’t have yet a full end to end test.

(Alexey) #26

Hello. Great work) I’d like to contribute and make a language model for Russian. As far as I can understand, to create a model I need to reproduce all the steps from README ( Correct me please, if I misunderstood something.
Another question is if we can somehow contribute this pretrained model to fastai. I would be appreciated if you provide some information on this question. I’m not a pro and maybe not a good fit for this work. But I’ll try my best, especially if given some guidance))

(Abu Fadl) #27

I am trying Arabic ULMFiT model. Also not a pro. I see @piotr.czapla is making several updates and refactors. I hope we get a working set of scripts with a bit more guidance. Right now I made the model but can’t run the classification step using xnli ( Issue is related to weights and I posted a related topic.

(Sebastian Ruder) #28

Thanks for the input! At this point, we won’t be able to make the NAACL 2019 deadline. We’ll probably target ACL 2019 and hope to contribute the pretrained models back to fastai before then.

(Kaspar Lund) #29

Hi is it necessary to clean wikidumps with this function that i have seen in session 10

def fixup(x):
x = x.replace(’#39;’, “’”).replace(‘amp;’, ‘&’).replace(’#146;’, “’”).replace(
‘nbsp;’, ’ ‘).replace(’#36;’, ‘$’).replace(’\n’, “\n”).replace(‘quot;’, “’”).replace(

’, “\n”).replace(’\"’, ‘"’).replace(’’,‘u_n’).replace(’ @.@ ‘,’.’).replace(
’ @-@ ‘,’-’).replace(’\’, ’ \ ‘)
return re1.sub(’ ', html.unescape(x))

(Abu Fadl) #30

Update: managed to train lm using most recent fixes from @piotr.czapla Issue 20 closed.

(Alexey) #31

Thanks. Will try to recreate all the steps for russian language.

(Shun Tsuruno) #32

I’ve pretrained a language model for Japanese at our company and would like to contribute it to the model zoo.
However, the official zoo is not open yet and I haven’t been able to figure out what to do.
@piotr.czapla Can you kindly point me to what I should do next?

Here’s what I’ve done so far:

  1. Cloned @piotr.czapla’s ulmfit-multilingual repo (
  2. Created a local branch
  3. Refactored the code to use sentencepiece tokenizer instead of Moses tokenizer (on Japanese Wikipedia) before lm pretraining
  4. Pretrained the language model on Japanese Wikipedia
  5. Fine-tuned + classified MedWeb (medical tweets) and Aozora-bunko (license-free books) datasets

Language Model Zoo :gorilla:
(Abu Fadl) #33

For Arabic, I got 24.7299 perplexity on small wiki corpus using most recent scripts (with some minor adjustments). Used 30k vocab. Will see if I can do better. Trained with both qrnn and bidir off.

(Piotr Czapla) #34

@AbuFadl @ademyanchuk, @s.tsuruno, @kasper, I’m very happy you are interested in contributing.

It is being done by fastai preprocessing scripts, so you don’t have to do it manually. I think it helps with accuracy but I’m checking that assumption.

Start a Russian thread in Model Langauge zoo. I think you can start using the ulmfit-multilingual now for your tests once you have something let me know we will think how to get that incorporated into the repo.

@s.tsuruno That is awesome can you share your results on the language model zoo. Have you trained it using old or new library? To contribute the language model we need to create a set of scripts that let us reproduce your results, so that Jeremy can later make updates to the tokenization mechanism or the models and be still be able to retrain the models. I haven’t yet though how to organize the ulmfit-multilingual repo yet so that this is possible. I’m open for suggestions. But rightly the process should look as follow:

  • create scripts to download validation datasets (aozora-bunko & medweb)
  • create shell scripts to train LM that has all the hyperparameters you have used (we would have to adapt the current sentence piece tokenization)
  • create validation scripts to check if the pre-trained models are working fine.
  • create shell scripts to train cls on medweb and aozora-bunko
  • share the pre-trained models on google drive, until we manage to move them to the,

(Piotr Czapla) #35

an update on the progress of works on ulmfit-multilingual

Getting everything in shape is taking us a bit longer than anticipated but we are getting there. To test our work I wanted to make a set of scripts that let you train your own version of ulmfit from WIkiText-103 to IMDB classification and get the same accuracy as in case of Jeremy scripts.
Even though we still have some issues If you are eager and happy to do some testing and bug fixing you can start experimenting now. The most recent version is in refactoring branch.

The classification script is working fine for two tokenization methods: Fastai ‘f’, moses+ fastai preprocessing ‘vf’. I’m testing pure moses ‘v’ right now. There are some issues with pretraining a language model, but I’m hoping it is just a matter of training time (I’ve trained only for 10 epochs, now I’m testing on 20) and we still have some remaining issue with sentence-piece implementation (it needs a bit of testing and love).

The new API is almost done and it lets you do the experiments from the command line or Jupiter notebooks. The experiment folder has two example notebooks.
To run a training form command line run:

$ python -m ulmfit lm --dataset-path data/wiki/wikitext-103 --bidir=False --qrnn=False --tokenizer=f --name 'bs40' --bs=40 --cuda-id=0  -  train 20 --drop-mult=0.9
Model dir: data/wiki/wikitext-103/models/f60k/lstm_bs40.m
$ python -m ulmfit cls --dataset-path data/imdb --base-lm-path data/wiki/wikitext-103/models/f60k/lstm_bs40.m - train 20   

As I said this is still work in progress hence it is in refactoring branch, but if you are happy to work do some debugging and testing feel free to start using it now.

(Piotr Czapla) #36

It make sense, qrnn is faster but it seems harder to train we need to do some hyper paramters tuning. Bidir needs lots of RAM at the moment. I’m testing it on En. once I get good results I let you know. Re perplexity it doesn’t tell you much regarding the accuracy on down stream tasks. Can you try to find some good data sets to run classification with previous results?

(Abu Fadl) #37

Thanks @piotr.czapla Actually qrnn trained on small corpus but fastai couldn’t read weights (tested 2 days ago - the 1.decoder.bias and 0.encoder.weight). Bidir failed with error related to unmatching batch sizes ValueError: Expected input batch_size (182406080) to match target batch_size (12160).
I am working on cleaning my notebook and will post link later. Also, next is work on benchmarking - at least xnli.

(Abu Fadl) #38

Pleased to post my notebook for Arabic language building language model from wikipedia dump (small corpus - limited by colab gpu memory). Hope it helps others and sorry for poor coding practices :slight_smile:

(Abu Fadl) #39

I tried to run the code on xnli data (Arabic). The accuracy was rather low ~60%. Are there any results from xnli (other than En)?
Note: The relative path causes error in fastai’s when looking for the pretrained filenames (copied model files to those deep locations as work-around) and labels need to be converted to int for this to work. Also getting buffer truncation and kernel dies when doing 100% test.

(Piotr Czapla) #40

I’m testing imdb, and have issues with the way we train language models. Once I have this fixed I let you know.

Note: The relative path causes error in fastai’s when looking for the pretrained filenames (copied model files to those deep locations as work-around)

This is the behaviour of “/” operator on paths. Fastai assumes the model is in specific location next to data, to workaround this you can make the paths absulute so model_dir/filepath becomes simply filepath (if the filepath is absolute)

and labels need to be converted to int for this to work.

I don’t work with XNLI yet as the training on imdb is flawed, Propose a PR if you think there is a bug.

Also getting buffer truncation and kernel dies when doing 100% test.

I haven’t got what you mean here