Gals & Guys we are summarising the work on ulmfit-multilingual, and @sebastianruder is writing a short paper.
I’m going through the forum threads and I’m adding ppl that contributed some interesting tests to a private thread, but there is a chance I’ve missed something. If you feel you should be added please let me know.
You can check if you are in the thread by following this link. If I missed you please let me know on private message or thread @ mention me in your language that shows the results.
The paper has a deadline soon (end of Feb). If you would like your results to be added we need to have a pull request by next week that includes:
the code to download your dataset
the “.md” with the results
and links to pretrained LM weights.
So we can run the classification task to double check the results are reproducible without need to train the full LM. If you happen to contribute to the language covered by MLDoc the LM’s won’t be necessery as we will have both lstm and qrnn based one.
Just to chip in, everyone who contributed a dataset and achieved good results that we can reproduce on the data will be included on the paper. Please provide additionally the following information in the PR so that it will be easier to attribute the work:
Finalizing results for Thai to use fastai 1.0.38 and above (previous results and datasets are based on 1.0.22). Do you have a hard deadline for the pull request? @sebastianruder
I’m DS @ KPN, NLP practitioner and leader of mlcourse.ai.
Currently doing SSL in parallel with English (amazon product reviews) and Dutch texts. Have tried a lot of models and approaches, and currently ULMFiT works best for reviews in English.
@benjaminvdb thanks a lot for sharing the pretrained Dutch model!
I only regret that found this thread so late. If any contribution from my side needed - I’m open.
some months ago I trained an ULMFiT model for Portuguese, but this was pre fastai v1. I want to give it a try again with v1 and in the context of the multilingual ULMFiT. I am looking for some guidance on the corpus to start with.
I noticed while doing it the first before that En Wikitext-103 text quality is much better than the pt.wikipedia articles I got in the dump. The reason is that they get it from the featured pages. There is no specific dump for featured wikipedia pages in portuguese. Moreover there are fewer featured pt pages anyway.
I wonder how have you chosen the input data. Any guidance on this issue?
@fredguth: It’s true that regular Wikipedia dumps are not as clean as Wikitext-103. However, I’ve trained a few language models on the Dutch Wikipedia and it performed pretty well, i.e., I was able to train good classifiers on very small datasets.
For example, I achieved 88% accuracy on a binary classification problem (sentiment with two polarities: neutral and positive) on only 250 examples. This can only mean that the LM was able to catch language basics pretty well and it didn’t have to learn them from the target dataset.
The question is whether you really need curated data for training your LM. It might be that instead of spending time on this, it would be more productive to use more data, e.g., CommonCrawl. I don’t know the answer to this question though.
I am trying to train the lm model, but I cant make it work. Preprocess was ok. Can someone point me what am I doing wrong? I have tried removing --bs=40 and still does not work
~/Code/ulmfit-multilingual pt* 1m 8s
❯ python -m ulmfit lm --dataset-path data/wiki/pt-100 --bidir=False --qrnn=False --tokenizer=f --name ‘bs40’ --bs=40 --cuda-id=0 - train 20 --drop-mult=0.9
Traceback (most recent call last):
File “/home/fredguth/anaconda3/envs/fastai/lib/python3.7/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/home/fredguth/anaconda3/envs/fastai/lib/python3.7/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/fredguth/Code/ulmfit-multilingual/ulmfit/main.py”, line 121, in
fire.Fire(ULMFiT())
File “/home/fredguth/anaconda3/envs/fastai/lib/python3.7/site-packages/fire/core.py”, line 127, in Fire
component_trace = _Fire(component, args, context, name)
File “/home/fredguth/anaconda3/envs/fastai/lib/python3.7/site-packages/fire/core.py”, line 366, in _Fire
component, remaining_args)
File “/home/fredguth/anaconda3/envs/fastai/lib/python3.7/site-packages/fire/core.py”, line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File “/home/fredguth/Code/ulmfit-multilingual/ulmfit/main.py”, line 38, in lm
params = LMHyperParams(**changes)
TypeError: init() got an unexpected keyword argument ‘bs’
@piotr.czapla I noticed that this branch is active for sometime now. Is there a deadline for merging with fast.ai master branch? Also, I understood that one of the goals of this thread is that there was a wish of using these pretrained models somehow in a conference which is past due. Is there still a reason to keep this as branch? Is there some goal we are trying to reach?
We posted the paper and are waiting for the review I haven’t yet managed to merge all the code and update it to the newest fast.ai as we had another conference where we were using ULMFiT that we have learned about 2 weeks before the deadline so we had a thought time.
Seems we are on the same track. Did you manage to deploy a model trained with ULMFiT using Torch.JIT? If yes would be great if you could share your experience.
dumb question - if i understood correctly this thread is about creating ULMFit model per language. Are there any plans to create single model for multiple languages (like BERT)?
I have one question regarding those two topics. They have same agenda right? Agenda being creating and using ULMFiT for other (not yet implemented) languages. Or is Multilingual in the title specified for language model that can learn multiple languages (I get a different impression, hence the question)?
Hello Marin. In my opinion this thread is more about the former than the later. Here we attempt to use ULMfit for different NLP domains in different languages and in most cases with grate success. Still, ideas on the model that “rule them all” are more than welcomed))
{LANG}-2 for small wiki with max number of tokens in each train:valid:test = 2000k:200k:200k (10:1:1)
{LANG}-100 for large wiki with max number of tokens in each train:valid:test = 98000k:200k:200k (49:1:1)
{LANG}-all for all wiki with max number of tokens in each train:valid:test = all:200k:200k (>49:1:1)
Is it on purpose that the number of tokens (at the end it is number of articles) of valid and test are the same for small, large and all wiki? In my case with indonesian wiki, I get 446632 articles as training dataset and only 256 articles in valid or test dataset for large wiki. I think the number of valid and test dataset are just too small comparing to training dataset in large wiki (could be worse in all wiki).
Hi, I need to know training time of LM, cause I need schedule GPU on server. I’m training LM for Croatian.
I mainly follow @duxan and @ademyanchuk notebooks for training LM models (they have also printed training times), so maybe you two could know most precisely. GPU is GeForce GTX 1080 (is this maybe too weak GPU for this). What GPUs you two used (or anybody that trained LM models for their respective languages)?
Hi.
I trained LM on GTX 1080ti. I used near 7Gb of gpu memory. It took ~26 hours to train for perplexity ~27. Hope this information helps and I correctly understand you question.
Could be but the perplexity is very poor estimation for the accuracy on downstream task it is there only ot make sure the model is being trained so I wouldn’t care too much about validation the smaller the faster the training :).