Multilingual ULMFiT

piotr.czapla · January 27, 2019, 6:14pm

you would have to fill all the hyper params then. It is easier to train the LM with LMHyperParams then load it using CLSHyperParams.from_lm(dataset_path, base_lm_path). So in your case:

exp = LMHyperParams(dataset_path='data/wiki/ru-100/', qrnn=False, tokenizer='v', lang='ru', name='russian')

exp = CLSHyperParams.from_lm('data/my_class_task_data', 'data/wiki/ru-100/models/v70k/lstm_russian.m', name='russian') # get the exact name of your model from execution of LMHyperParams

This part isn’t ready yet you would have to modify this file: https://github.com/n-waves/ulmfit-multilingual/blob/master/ulmfit/train_clas.py#L102-L107

Can you suggest something and make a PR, I would love to incorporate Russion into the ulmfit-multilingual.
I would based it on the load_cls_data_imdb.

Russian is a bit like Polish meaning that both languages have rich morphology so I would use sentencepiece tokenizer with 25k tokens.

Here is how I would train russian LM:

python -m ulmfit lm --dataset-path data/wiki/ru-100 --bidir=False --qrnn=False --tokenizer=sp --name 'bs40'  -  train 20 ---bs=40

piotr.czapla · January 27, 2019, 6:20pm

Having Dutch in ulmfit-multilingual would be awesome is your dataset public?

If you may want to use a merged wikipedia’s, to train this kind of LM. @eisenjulian is working on very similar task maybe he has some code to share.

You may as well think about using LASER it was released 5 days ago. https://github.com/facebookresearch/LASER

ademyanchuk · January 28, 2019, 3:25am

Thank you @piotr.czapla. As for now i’m working on the level of fastai abstraction to make it more simple (for me) to understand all the logic. I’m trying to work on upstream classification tasks and get near the best results (in comparison with current benchmark - http://text-machine.cs.uml.edu/projects/rusentiment/).
Next I’ll try to increase the amount of domainspecific (twiter) data for LM finetuning to see if it helps on upstream classification task.
Later I’ll experiment with different tokenizers and try to addapt train_clas.py to my data.

piotr.czapla · January 28, 2019, 5:28pm

@ademyanchuk Perfect ! Can you create a “ULMFiT - Russian” page and put it on the top page of Language Model Zoo? @ppleskov was working on the Russian model in the past but I don’t think he managed to beat the SOTA Russian Sentiment Analysies, He was using the following benchmark: http://www.dialog-21.ru/evaluation/2016/sentiment/

ademyanchuk · January 29, 2019, 3:24am

Shure. I’m not currently best on this one

But i actuallty did some experiments on

and it turned out to be easy task to classify, i get near .98 f1. Check my solution twice on any bugs, but cant find something. So I believe ruSentEval is beaten.

I create thread for ULMFit - Russian ULMFiT - Russian

tblock · January 31, 2019, 11:00am

I just posted about a German topic classification dataset to the ‘ULMFiT German’-page. The dataset might be useful for some of you here.

benjaminvdb · February 1, 2019, 7:47am

I’m very happy to share the Dutch dataset and the weights of the trained language model!

However, for the former, I’m not sure whether I can publish it, since the contents are scraped. I’ve read the website’s disclaimer and there’s nothing in it that forbids it, but I’m still not quite sure about legal issues

For language model weights, do you have pointers how I can best package and describe it? I’ve never shared network weights before.

jeremy · February 1, 2019, 5:37pm

I believe that in most jurisdictions scraping is allowed, unless it is explicitly restricted by the TOS of the site.

benjaminvdb · February 4, 2019, 9:15am

@jeremy That sounds promising!

I still wanted to give credits to the owners of the website and its reviewers, so I’ve sent them an email to see if they’re open for this kind of publication. I’ve already tidied up my code and the dataset itself, so it’s ready for publication. I’m just waiting for a response from them now… Fingers crossed!

sgugger · February 7, 2019, 9:10pm

Please take note of this breaking change coming in v1.0.43

benjaminvdb · February 8, 2019, 12:38pm

@sgugger: thanks for the heads-up!

@piotr.czapla: considering recent developments in the FastAI library, what’s the status of ulmfit-multilingual? I’ve used slightly edited versions of those scripts to generate a language model for Dutch. Is this still the way to go?

piotr.czapla · February 12, 2019, 2:52pm

@benjaminvdb, Good work, and on time so we can add it to the paper that summarise multilingual ULMFiT.

Recently @mkardas made ulmfit-multlingual compatible with the fastai, we have it in a separate branch that I’m looking in to merging to master tomorrow.

piotr.czapla · February 12, 2019, 3:02pm

Gals & Guys we are summarising the work on ulmfit-multilingual, and @sebastianruder is writing a short paper.
I’m going through the forum threads and I’m adding ppl that contributed some interesting tests to a private thread, but there is a chance I’ve missed something. If you feel you should be added please let me know.

You can check if you are in the thread by following this link. If I missed you please let me know on private message or thread @ mention me in your language that shows the results.

The paper has a deadline soon (end of Feb). If you would like your results to be added we need to have a pull request by next week that includes:

the code to download your dataset
the “.md” with the results
and links to pretrained LM weights.

So we can run the classification task to double check the results are reproducible without need to train the full LM. If you happen to contribute to the language covered by MLDoc the LM’s won’t be necessery as we will have both lstm and qrnn based one.

sebastianruder · February 12, 2019, 3:48pm

Just to chip in, everyone who contributed a dataset and achieved good results that we can reproduce on the data will be included on the paper. Please provide additionally the following information in the PR so that it will be easier to attribute the work:

your first name and last name;
your preferred affiliation;
your email address.

Thanks for your efforts!

cstorm125 · February 24, 2019, 11:57am

Finalizing results for Thai to use fastai 1.0.38 and above (previous results and datasets are based on 1.0.22). Do you have a hard deadline for the pull request? @sebastianruder

Edit: sorry I think I saw this 12 days too late

yorko · February 25, 2019, 4:55pm

Hi all!

I’m DS @ KPN, NLP practitioner and leader of mlcourse.ai.
Currently doing SSL in parallel with English (amazon product reviews) and Dutch texts. Have tried a lot of models and approaches, and currently ULMFiT works best for reviews in English.

@benjaminvdb thanks a lot for sharing the pretrained Dutch model!
I only regret that found this thread so late. If any contribution from my side needed - I’m open.

ademyanchuk · February 26, 2019, 1:13am

Hello)
As far as I know, the fastai library is open to any kind of useful contribution. See the link.
Best regards))

fredguth · April 5, 2019, 2:12am

Hi, all!

some months ago I trained an ULMFiT model for Portuguese, but this was pre fastai v1. I want to give it a try again with v1 and in the context of the multilingual ULMFiT. I am looking for some guidance on the corpus to start with.

I noticed while doing it the first before that En Wikitext-103 text quality is much better than the pt.wikipedia articles I got in the dump. The reason is that they get it from the featured pages. There is no specific dump for featured wikipedia pages in portuguese. Moreover there are fewer featured pt pages anyway.

I wonder how have you chosen the input data. Any guidance on this issue?

benjaminvdb · April 5, 2019, 8:23am

@fredguth: It’s true that regular Wikipedia dumps are not as clean as Wikitext-103. However, I’ve trained a few language models on the Dutch Wikipedia and it performed pretty well, i.e., I was able to train good classifiers on very small datasets.

For example, I achieved 88% accuracy on a binary classification problem (sentiment with two polarities: neutral and positive) on only 250 examples. This can only mean that the LM was able to catch language basics pretty well and it didn’t have to learn them from the target dataset.

The question is whether you really need curated data for training your LM. It might be that instead of spending time on this, it would be more productive to use more data, e.g., CommonCrawl. I don’t know the answer to this question though.

fredguth · April 6, 2019, 4:53am

I am trying to train the lm model, but I cant make it work. Preprocess was ok. Can someone point me what am I doing wrong? I have tried removing --bs=40 and still does not work

~/Code/ulmfit-multilingual pt* 1m 8s
❯ python -m ulmfit lm --dataset-path data/wiki/pt-100 --bidir=False --qrnn=False --tokenizer=f --name ‘bs40’ --bs=40 --cuda-id=0 - train 20 --drop-mult=0.9
Traceback (most recent call last):
File “/home/fredguth/anaconda3/envs/fastai/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/home/fredguth/anaconda3/envs/fastai/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/fredguth/Code/ulmfit-multilingual/ulmfit/main.py”, line 121, in
fire.Fire(ULMFiT())
File “/home/fredguth/anaconda3/envs/fastai/lib/python3.7/site-packages/fire/core.py”, line 127, in Fire
component_trace = _Fire(component, args, context, name)
File “/home/fredguth/anaconda3/envs/fastai/lib/python3.7/site-packages/fire/core.py”, line 366, in _Fire
component, remaining_args)
File “/home/fredguth/anaconda3/envs/fastai/lib/python3.7/site-packages/fire/core.py”, line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File “/home/fredguth/Code/ulmfit-multilingual/ulmfit/main.py”, line 38, in lm
params = LMHyperParams(**changes)
TypeError: init() got an unexpected keyword argument ‘bs’