Multilingual ULMFiT

(Piotr Czapla) #81

Yep I think the potion was removed, in favour of bidir training. but feel free to add it.

it was URLs.WT103_1

It is really good to understand every step of your pipeline (i’ve learned it hard way), but weights are often published without exact steps to reproduce them. Think, do you know how pytorch resnet weights are reproduced?

Why not, I’d like to have a way to retrain them if necessary.

I haven’t try it but it should work. Try it. Comparing it with training on a merged Mexican-Spanish and Spanish wikipedia is quite interesting.

(Piotr Czapla) #82

looks interesting to speed up training and to get better accuracy, I would love to see it in action

I haven’t got you. You need to get that joined in LanguageModelLoader to create batches.

I would be interested to see how this helps

That goes against the idea of quick finetuning.

That was on a server that was destroyed and I don’t have the weights. But I have some pretrained models with weights but I need to check how good they are.

(Tomasz Pietruszka) #83

@piotr.czapla thanks for your feedback

Well, I think concatenating all documents is not really necessary, just easier (and more efficient computationally).

I would imagine taking a bucket of similar-length sequences, bptt-ing through them from their beginning to end (using PackedSequences to avoid padding), and then taking a next bucket of sequences.

Well, what I meant was:

  1. take the general LM (trained on wikitext)
  2. fine-tune the LM on the downstream task’s dataset
  3. add the classifier head to the LM, without removing the old one, train both further.

The hypothesis: on a small dataset of short text, fine-tuning the classifier might make the encoder “forget” the language. It also might not, and I guess it is not that easy to test.

(Gaurav) #84

I am using pip for installing & all my packages are as suggested in this page.

Please help me to resolve the issue.


(Alexey) #85

Hello guys) First of all, I woud like to thank you for all the hard work you are doing to make it easier for us following behind.
To @piotr.czapla
I’m working on training russian language model and I have a couple of questions.
My lm training steps are:
exp = LMHyperParams(dataset_path='data/wiki/ru-100/', qrnn=False, tokenizer='v', lang='ru', name='russian')
learn = exp.create_lm_learner(data_lm=data_lm)
learn.fit_one_cycle(20, 1e-3, moms=(0.8,0.7))

I would like to test a model on the upstream task (text classification) and use functionality of ulmfit.train_clas. Speaking of questions:

  1. How could I use trained lm in classification task? My suggestion is something like this:
    exp = CLSHyperParams('data/my_class_task_data')
    exp.pretrained_fnames = path.to_my_best_model
    Is my suggestion right or do i miss something?
  2. If I use other than imdb task for classification and wish to apply CLSHyperParams, do I need to recreate my_task data directory structure the same way as imdb data structure?
  3. What is a good choice of vocabulary size to train LM from wiki from scratch?

Any advices would be greatly apreciated))


Hey. My dataset is a mixture of French and English and I have a classification problem. Can you give me some advice on using Ulmfit? Should I train a new LM on mixed French and English wiki? Thanks

(Benjamin van der Burgh) #87

Hi Grigor!

Is the text within a document in your dataset multi-language or is each document in only one of multiple languages? If the latter is the case, you can use something like the langdetect Python library. Run detect_language() to find the probabilities for each language and remove the document if the probability for your language is below a given threshold.

In my case I have a dataset of book reviews that are mostly in Dutch. Some reviewers have translated the Dutch review into other languages, so I use langdetect to remove these multi-language documents from the dataset.

I hope this helps!


Hi, thanks for the response.
I tried langdetect but as my dataset consists from descriptions which are not very long, it found ‘wrong’ languages with very high probability.
So having the fact that I cant detect language, I am thinking of training a new model from scratch based on French and English wiki and would like to get an advise.