Multilingual ULMFiT

I am trying Arabic ULMFiT model. Also not a pro. I see @piotr.czapla is making several updates and refactors. I hope we get a working set of scripts with a bit more guidance. Right now I made the model but can’t run the classification step using xnli ( train_clas.py). Issue is related to weights and I posted a related topic.

Thanks for the input! At this point, we won’t be able to make the NAACL 2019 deadline. We’ll probably target ACL 2019 and hope to contribute the pretrained models back to fastai before then.

Hi is it necessary to clean wikidumps with this function that i have seen in session 10

def fixup(x):
x = x.replace(’#39;’, “’”).replace(‘amp;’, ‘&’).replace(’#146;’, “’”).replace(
‘nbsp;’, ’ ‘).replace(’#36;’, ‘$’).replace(’\n’, “\n”).replace(‘quot;’, “’”).replace(

’, “\n”).replace(’\"’, ‘"’).replace(’’,‘u_n’).replace(’ @.@ ‘,’.’).replace(
’ @-@ ‘,’-’).replace(’\’, ’ \ ‘)
return re1.sub(’ ', html.unescape(x))

Update: managed to train lm using most recent fixes from @piotr.czapla Issue 20 closed.

1 Like

Thanks. Will try to recreate all the steps for russian language.

I’ve pretrained a language model for Japanese at our company and would like to contribute it to the model zoo.
However, the official zoo is not open yet and I haven’t been able to figure out what to do.
@piotr.czapla Can you kindly point me to what I should do next?

Here’s what I’ve done so far:

  1. Cloned @piotr.czapla’s ulmfit-multilingual repo (https://github.com/n-waves/ulmfit-multilingual/projects/1)
  2. Created a local branch
  3. Refactored the code to use sentencepiece tokenizer instead of Moses tokenizer (on Japanese Wikipedia) before lm pretraining
  4. Pretrained the language model on Japanese Wikipedia
  5. Fine-tuned + classified MedWeb (medical tweets) and Aozora-bunko (license-free books) datasets
1 Like

For Arabic, I got 24.7299 perplexity on small wiki corpus using most recent scripts (with some minor adjustments). Used 30k vocab. Will see if I can do better. Trained with both qrnn and bidir off.

1 Like

@AbuFadl @ademyanchuk, @s.tsuruno, @kasper, I’m very happy you are interested in contributing.

It is being done by fastai preprocessing scripts, so you don’t have to do it manually. I think it helps with accuracy but I’m checking that assumption.

Start a Russian thread in Model Langauge zoo. I think you can start using the ulmfit-multilingual now for your tests once you have something let me know we will think how to get that incorporated into the repo.

@s.tsuruno That is awesome can you share your results on the language model zoo. Have you trained it using old or new library? To contribute the language model we need to create a set of scripts that let us reproduce your results, so that Jeremy can later make updates to the tokenization mechanism or the models and be still be able to retrain the models. I haven’t yet though how to organize the ulmfit-multilingual repo yet so that this is possible. I’m open for suggestions. But rightly the process should look as follow:

  • create scripts to download validation datasets (aozora-bunko & medweb)
  • create shell scripts to train LM that has all the hyperparameters you have used (we would have to adapt the current sentence piece tokenization)
  • create validation scripts to check if the pre-trained models are working fine.
  • create shell scripts to train cls on medweb and aozora-bunko
  • share the pre-trained models on google drive, until we manage to move them to the nlp.fast.ai,

an update on the progress of works on ulmfit-multilingual

Getting everything in shape is taking us a bit longer than anticipated but we are getting there. To test our work I wanted to make a set of scripts that let you train your own version of ulmfit from WIkiText-103 to IMDB classification and get the same accuracy as in case of Jeremy scripts.
Even though we still have some issues If you are eager and happy to do some testing and bug fixing you can start experimenting now. The most recent version is in refactoring branch.

The classification script is working fine for two tokenization methods: Fastai ‘f’, moses+ fastai preprocessing ‘vf’. I’m testing pure moses ‘v’ right now. There are some issues with pretraining a language model, but I’m hoping it is just a matter of training time (I’ve trained only for 10 epochs, now I’m testing on 20) and we still have some remaining issue with sentence-piece implementation (it needs a bit of testing and love).

The new API is almost done and it lets you do the experiments from the command line or Jupiter notebooks. The experiment folder has two example notebooks.
To run a training form command line run:

$ python -m ulmfit lm --dataset-path data/wiki/wikitext-103 --bidir=False --qrnn=False --tokenizer=f --name 'bs40' --bs=40 --cuda-id=0  -  train 20 --drop-mult=0.9
Model dir: data/wiki/wikitext-103/models/f60k/lstm_bs40.m
...
$ python -m ulmfit cls --dataset-path data/imdb --base-lm-path data/wiki/wikitext-103/models/f60k/lstm_bs40.m - train 20   
...

As I said this is still work in progress hence it is in refactoring branch, but if you are happy to work do some debugging and testing feel free to start using it now.

1 Like

It make sense, qrnn is faster but it seems harder to train we need to do some hyper paramters tuning. Bidir needs lots of RAM at the moment. I’m testing it on En. once I get good results I let you know. Re perplexity it doesn’t tell you much regarding the accuracy on down stream tasks. Can you try to find some good data sets to run classification with previous results?

Thanks @piotr.czapla Actually qrnn trained on small corpus but fastai couldn’t read weights (tested 2 days ago - the 1.decoder.bias and 0.encoder.weight). Bidir failed with error related to unmatching batch sizes ValueError: Expected input batch_size (182406080) to match target batch_size (12160).
I am working on cleaning my notebook and will post link later. Also, next is work on benchmarking - at least xnli.

Pleased to post my notebook for Arabic language building language model from wikipedia dump (small corpus - limited by colab gpu memory). Hope it helps others and sorry for poor coding practices :slight_smile:

I tried to run the code on xnli data (Arabic). The accuracy was rather low ~60%. Are there any results from xnli (other than En)?
Note: The relative path causes error in fastai’s learn.py when looking for the pretrained filenames (copied model files to those deep locations as work-around) and labels need to be converted to int for this to work. Also getting buffer truncation and kernel dies when doing 100% test.

I’m testing imdb, and have issues with the way we train language models. Once I have this fixed I let you know.

Note: The relative path causes error in fastai’s learn.py when looking for the pretrained filenames (copied model files to those deep locations as work-around)

This is the behaviour of “/” operator on paths. Fastai assumes the model is in specific location next to data, to workaround this you can make the paths absulute so model_dir/filepath becomes simply filepath (if the filepath is absolute)

and labels need to be converted to int for this to work.

I don’t work with XNLI yet as the training on imdb is flawed, Propose a PR if you think there is a bug.

Also getting buffer truncation and kernel dies when doing 100% test.

I haven’t got what you mean here

1 Like

@piotr.czapla
Thank you for the quick and helpful reply.
I installed the fast.ai library from the master branch of fast.ai GitHub site. The version must be around v1.0.28.
I guess I can create your suggested scripts within a week or so.
Will let you know when I’m done.

@piotr.czapla i have finally train a french lm using sentencepiece and fastai 1.0.37.dev0 using codesnippets from you, @sgugger, @tomsthom .
The results are very preliminary: epoch, train error, valid-erros, accuracy 10 3.117836 3.239415 0.366795
there are still issues with the control tokens in fastai vs sentencepiece

Creating the databunch i had to scale down on the number of sentence due to the hugh memory consumption in TextLMDataBunch.from_csv/TextLMDataBunch.from_df etc. This could be reduced but i wonder what is the status of the merge of fastai with the version at "n-waves/ulmfit-multilingual"

For xnli, I am trying to take the lang filtered from xnli.dev.tsv (apparently for train) and corresponding test set and check the model against gold_label. The multinli.train.[lang].tsv is for translation baseline, so if I am not mistaken, the train_clas code (xnli) needs adjustments.
Buffer truncation is jut memory limitation error on colab.

@piotr.czapla, @eisenjulian
Hi how are you handling then tokenization/encoding with sentencepiece (SP)
Background:
I made a run with the following argument for SP:
special_cases=[
text.transform.TK_MAJ,
text.transform.TK_UP,
text.transform.TK_REP,
text.transform.TK_WREP,
text.transform.FLD ]
sp_params = f"–input={pathSrc_list} "
f"–eos_id=-1 "
f"–control_symbols={str_specialcases} "
f"–character_coverage=1.0 "
f"–model_prefix={model_prefix} "
f"–vocab_size={self.vocab_size} "
f"–model_type={self.model_type} "

I idea was to reserve ids for the special symbols. However, this does not work because fastai inserts BOS, FLD in the _join_texts i the Tokenizerprocessor before the tokenization. Sentenpiece will therefore ignore the symbols in order to prevent the user from manipulation the tokenizer. ie BOS and FLD gets decoded to something like x x b s and x x fld. Decoding BOS and FLD in this way will confuse rather than help a classifier

In order to preserve the symbols i am currently making a new run with
special_cases=[
text.transform.BOS,
text.transform.PAD,
text.transform.TK_MAJ,
text.transform.TK_UP,
text.transform.TK_REP,
text.transform.TK_WREP,
text.transform.FLD ]
sp_params = f"–input={pathSrc_list} "
f"–bos_id=-1 "
f"–eos_id=-1 "
f"–pad_id=-1 "
f"–user_defined_symbols={str_specialcases} "
f"–character_coverage=1.0 "
f"–model_prefix={model_prefix} "
f"–vocab_size={self.vocab_size} "
f"–model_type={self.model_type} "

This seems to works because now the tokenized cell start with. ▁ xxbos ▁ xxfld ▁1 ▁entre ▁1945 ▁et ▁1948,

Also what is the purpose of inserting BOS and FLD in _join_texts. Your copetion went well even with this confusing tokenization of BOS and FLD as above ?

Is it possible to resume language model training (from wiki tokens) if it stops after epoch n where n < num_epochs? I am training a language model and get buffer truncation error after which kernel stops but there are cls-history.csv, lm_1.pth, lm_2.pth in the model folder (it died in 3rd epoch).

It is awesome to hear that you managed to get a french language model. Have you used Sylvain code to split wikipedia by articles to train? It is super important as without it Language model train to some low perplexity but it fails on the downstream task. This is small bug was causing all the troubles with training classification on imdb to high values. (without the fixt I was getting ~80% accuracy up to 92%, with the fix I’m getting 94.5%)

The refactoring branch is compatible with the latest fastai

I haven’t played with the Sentence Piece yet, and the code is a bit outdated but I see your point. Maybe you can create the issue in the ulmfit-multilingual repo and we discuss there?
Your approach seems to make sense although let’s get it incorporated.

Kaspar, for the competition we used old fastai that didn’t have so many layers of abstraction, which gave us greater control over the tokenization we didn’t add “xxfld 1” as it would break the perplexity calculation. These fields are inserted by Fastai, I haven’t remove them yet as I was first focusing on getting good accuracy and I figured that a text that is added to every training example won’t cause issues. But I’m intending to get some more control over the tokens that are inserted to the text to clean this up and make it more standardized.

I do that from jupyter notebook where I can experiment with the learning rate. But I dont’ have example at hand. Simply create LMHyperParams object with all the paramters you put on command line, the create dataset and learn object and run learn.load(“lm_2”), then learn.lr_find() …