Multilingual ULMFiT

piotr.czapla · January 12, 2019, 9:09pm

Yep I think the potion was removed, in favour of bidir training. but feel free to add it.

it was URLs.WT103_1

It is really good to understand every step of your pipeline (i’ve learned it hard way), but weights are often published without exact steps to reproduce them. Think, do you know how pytorch resnet weights are reproduced?

Why not, I’d like to have a way to retrain them if necessary.

I haven’t try it but it should work. Try it. Comparing it with training on a merged Mexican-Spanish and Spanish wikipedia is quite interesting.

piotr.czapla · January 12, 2019, 9:24pm

looks interesting to speed up training and to get better accuracy, I would love to see it in action

I haven’t got you. You need to get that joined in LanguageModelLoader to create batches.

I would be interested to see how this helps

That goes against the idea of quick finetuning.

That was on a server that was destroyed and I don’t have the weights. But I have some pretrained models with weights but I need to check how good they are.

pjetro · January 12, 2019, 9:35pm

@piotr.czapla thanks for your feedback

Well, I think concatenating all documents is not really necessary, just easier (and more efficient computationally).

I would imagine taking a bucket of similar-length sequences, bptt-ing through them from their beginning to end (using PackedSequences to avoid padding), and then taking a next bucket of sequences.

Well, what I meant was:

take the general LM (trained on wikitext)
fine-tune the LM on the downstream task’s dataset
add the classifier head to the LM, without removing the old one, train both further.

The hypothesis: on a small dataset of short text, fine-tuning the classifier might make the encoder “forget” the language. It also might not, and I guess it is not that easy to test.

gauravgulati8 · January 14, 2019, 7:51am

I am using pip for installing & all my packages are as suggested in this page.
fastai==0.7
torch==0.3.1
torchtext==0.2.3
torchvision==0.2.1

Please help me to resolve the issue.

Regards,
Gaurav

ademyanchuk · January 15, 2019, 6:07am

Hello guys) First of all, I woud like to thank you for all the hard work you are doing to make it easier for us following behind.
To @piotr.czapla
I’m working on training russian language model and I have a couple of questions.
My lm training steps are:
exp = LMHyperParams(dataset_path='data/wiki/ru-100/', qrnn=False, tokenizer='v', lang='ru', name='russian')
learn = exp.create_lm_learner(data_lm=data_lm)
learn.unfreeze()
learn.fit_one_cycle(20, 1e-3, moms=(0.8,0.7))

I would like to test a model on the upstream task (text classification) and use functionality of ulmfit.train_clas. Speaking of questions:

How could I use trained lm in classification task? My suggestion is something like this:
exp = CLSHyperParams('data/my_class_task_data')
exp.pretrained_fnames = path.to_my_best_model
Is my suggestion right or do i miss something?
If I use other than imdb task for classification and wish to apply CLSHyperParams, do I need to recreate my_task data directory structure the same way as imdb data structure?
What is a good choice of vocabulary size to train LM from wiki from scratch?

Any advices would be greatly apreciated))

Grigor · January 16, 2019, 9:46am

Hey. My dataset is a mixture of French and English and I have a classification problem. Can you give me some advice on using Ulmfit? Should I train a new LM on mixed French and English wiki? Thanks

benjaminvdb · January 16, 2019, 3:47pm

Hi Grigor!

Is the text within a document in your dataset multi-language or is each document in only one of multiple languages? If the latter is the case, you can use something like the langdetect Python library. Run detect_language() to find the probabilities for each language and remove the document if the probability for your language is below a given threshold.

In my case I have a dataset of book reviews that are mostly in Dutch. Some reviewers have translated the Dutch review into other languages, so I use langdetect to remove these multi-language documents from the dataset.

I hope this helps!

Grigor · January 18, 2019, 10:43am

Hi, thanks for the response.
I tried langdetect but as my dataset consists from descriptions which are not very long, it found ‘wrong’ languages with very high probability.
So having the fact that I cant detect language, I am thinking of training a new model from scratch based on French and English wiki and would like to get an advise.
Thanks

piotr.czapla · January 27, 2019, 6:14pm

you would have to fill all the hyper params then. It is easier to train the LM with LMHyperParams then load it using CLSHyperParams.from_lm(dataset_path, base_lm_path). So in your case:

exp = LMHyperParams(dataset_path='data/wiki/ru-100/', qrnn=False, tokenizer='v', lang='ru', name='russian')

exp = CLSHyperParams.from_lm('data/my_class_task_data', 'data/wiki/ru-100/models/v70k/lstm_russian.m', name='russian') # get the exact name of your model from execution of LMHyperParams

This part isn’t ready yet you would have to modify this file: https://github.com/n-waves/ulmfit-multilingual/blob/master/ulmfit/train_clas.py#L102-L107

Can you suggest something and make a PR, I would love to incorporate Russion into the ulmfit-multilingual.
I would based it on the load_cls_data_imdb.

Russian is a bit like Polish meaning that both languages have rich morphology so I would use sentencepiece tokenizer with 25k tokens.

Here is how I would train russian LM:

python -m ulmfit lm --dataset-path data/wiki/ru-100 --bidir=False --qrnn=False --tokenizer=sp --name 'bs40'  -  train 20 ---bs=40

piotr.czapla · January 27, 2019, 6:20pm

Having Dutch in ulmfit-multilingual would be awesome is your dataset public?

If you may want to use a merged wikipedia’s, to train this kind of LM. @eisenjulian is working on very similar task maybe he has some code to share.

You may as well think about using LASER it was released 5 days ago. GitHub - facebookresearch/LASER: Language-Agnostic SEntence Representations

ademyanchuk · January 28, 2019, 3:25am

Thank you @piotr.czapla. As for now i’m working on the level of fastai abstraction to make it more simple (for me) to understand all the logic. I’m trying to work on upstream classification tasks and get near the best results (in comparison with current benchmark - http://text-machine.cs.uml.edu/projects/rusentiment/).
Next I’ll try to increase the amount of domainspecific (twiter) data for LM finetuning to see if it helps on upstream classification task.
Later I’ll experiment with different tokenizers and try to addapt train_clas.py to my data.

piotr.czapla · January 28, 2019, 5:28pm

@ademyanchuk Perfect ! Can you create a “ULMFiT - Russian” page and put it on the top page of Language Model Zoo? @ppleskov was working on the Russian model in the past but I don’t think he managed to beat the SOTA Russian Sentiment Analysies, He was using the following benchmark: http://www.dialog-21.ru/evaluation/2016/sentiment/

ademyanchuk · January 29, 2019, 3:24am

Shure. I’m not currently best on this one

But i actuallty did some experiments on

and it turned out to be easy task to classify, i get near .98 f1. Check my solution twice on any bugs, but cant find something. So I believe ruSentEval is beaten.

I create thread for ULMFit - Russian ULMFiT - Russian

tblock · January 31, 2019, 11:00am

I just posted about a German topic classification dataset to the ‘ULMFiT German’-page. The dataset might be useful for some of you here.

benjaminvdb · February 1, 2019, 7:47am

I’m very happy to share the Dutch dataset and the weights of the trained language model!

However, for the former, I’m not sure whether I can publish it, since the contents are scraped. I’ve read the website’s disclaimer and there’s nothing in it that forbids it, but I’m still not quite sure about legal issues

For language model weights, do you have pointers how I can best package and describe it? I’ve never shared network weights before.

jeremy · February 1, 2019, 5:37pm

I believe that in most jurisdictions scraping is allowed, unless it is explicitly restricted by the TOS of the site.

benjaminvdb · February 4, 2019, 9:15am

@jeremy That sounds promising!

I still wanted to give credits to the owners of the website and its reviewers, so I’ve sent them an email to see if they’re open for this kind of publication. I’ve already tidied up my code and the dataset itself, so it’s ready for publication. I’m just waiting for a response from them now… Fingers crossed!

sgugger · February 7, 2019, 9:10pm

Please take note of this breaking change coming in v1.0.43

benjaminvdb · February 8, 2019, 12:38pm

@sgugger: thanks for the heads-up!

@piotr.czapla: considering recent developments in the FastAI library, what’s the status of ulmfit-multilingual? I’ve used slightly edited versions of those scripts to generate a language model for Dutch. Is this still the way to go?

piotr.czapla · February 12, 2019, 2:52pm

@benjaminvdb, Good work, and on time so we can add it to the paper that summarise multilingual ULMFiT.

Recently @mkardas made ulmfit-multlingual compatible with the fastai, we have it in a separate branch that I’m looking in to merging to master tomorrow.