Multilingual ULMFiT

Are there any new plans incorporating the new LASER models (

We haven’t had as much success with QRNN as I was hoping either…

Thank you very much for your clarification. Nevertheless I don’t know if this is a good move for people who desire to bring fastai models in production since onnx export doesn’t seem to work for now with a bach_first option set to true. Several issues have been raised on the pytorch github.

@pjetro I haven’t noticed your question before here are some answers:

Hie Tomek, I didn’t get your question. It is meant to be used on different languages with or without spacy the point is that we are going to run some experiments to see, what give us the best results. Sebastian initially wanted to keep Moses tokenization, and sentence piece as he thought it will get us better accuracy on local languages, I’ve added spacy so we can compare the performance.

Yeah, we play a bit to see how ULMFit works on XNLI and if it can be improved using tricks from papers like ELMO. I’ve started this repo so we can do that without disturbing the course, and once we have things that work we can contribute them to fastai. (hence we have fasta_contrib package).

What is important to me is that we can replicate the fantastic classification results on other languages using one repository and maybe we can tackle XNLI as it is essentially a more advanced classification.

What tweaks do you have in mind?

I think we are very close to the paper results, my models on imdb get the following error rates:

  • 5.4% sentencepiece and 4 layer lstm
  • 5.1% moses + fastai preprocessing
  • 5.2% fastai tokenization (starting from wt103)

And the performance reported in the paper is 5.0 to 5.3 on a single model, see the quote below:

Impact of bidirectionality At the cost of training a second model, ensembling the predictions of a forward and backwards LM-classifier brings a performance boost of around 0.5–0.7. On IMDb we lower the test error from 5.30 of a single model to 4.58 for the bidirectional model.

Sure, we have this repo separate to be able to have more ppl to contribute without making it harder for Sylvain and Jeremy to manage. Although I was hoping that it will be used mostly to get the multilingual polished, how about we get on a chat and discuss?

@claeyzre how about we try to get Torch.JIT to work for production code instead of ONNX?

1 Like

Finally, I pushed my code for Japanese ULMFiT to my GitHub repository.
It’s still a mess and I’ll clean them in the near future.
Let me know if you have any questions.

Piotr, thank you for your answer. Sorry for the delay, just got back from a long Xmas break.

The first part of my post was not really a question, more of a sanity check about what that repo is meant to be.

I want to focus on tasks with a small dataset for the downstream task (+ short documents)

  1. for a start, I wanted to get rid of the sequence padding in the classifier, by using PackedSequences. I know there was a failed attempt, but I think it could work and be beneficial. In progress now.
  2. in LM fine-tuning, the current approach of just concatenating everything into 1 giant text might not be optimal for short sequences
  3. when fine-tuning on a small dataset, we effectively throw away all embeddings that are not present in the downstream task’s training set. We could keep them, up to some number (the 30k/60k typical limit) and have less UNK tokens in test. I think it would be interesting to see if those values not fine-tuned embedding values stay somewhat relevant
  4. I was also thinking about jointly training the classifier and the language model (for the downstream task) as a way of sort-of regularization. I guess this is a long shot, and not easy to implement.

Any comments? Do e.g. do you already know that something from the list will not work? :slight_smile:

Thats great!

What settings/hyperparameters do you use? Could you share the commands used? And have you uploaded the trained models somewhere?

Also, what is the status on bidirectional models and classifiers?

Sure, but what chat do you mean?

BTW the minor changes I wanted to make already were mostly in README, but they might have gotten outdated now. I will check tomorrow.

Hello) is it necessary to install fastai in developer mode or could I use standard installation to work on LM and classification tasks?

@piotr.czapla 2 additional questions:

  1. am I missing something, or is there currently no way to train a backwards LM (and classifier)? I can try to add it, just want a confirmation
  2. you’ve mentioned starting from wt103. You meant this wt103_v1 version, right? Or the old one, without the_v1 suffix?

BTW, not sure if this is a common view or not, but I do not really like using those pretrained models, as it is not clear to me how exactly they were trained. Clear how something was trained == able to reproduce it (input data, software versions, all parameters, etc)

Maybe for ulmfit-multilingual, when pretrained models are ready to be published, it would be good to publish them along with some detailed instructions on how to reproduce? Or perhaps Dockerfiles, to have the environment better controlled?

I believe the developer install is only needed if you are planning to either:

  1. modify the library code, or
  2. use the ulmfit-multilingual repo.

Otherwise, it should not be needed. You can either:
a) Use the course notebooks with fastai 0.7 (as mentioned in )
b) Use the imdb_scripts with fastai 1.0. Not 100% sure if that will work though

Hi all, I did some searching in the forums and online but did not find any reults, apologies if this has been asked and answered before.

Has anyone tried fine tuning twice ? Would it be logical to assume that training a LM on a large corpus (for instance Spanish Wiki) then fine tuning that LM on a smaller corpus (Mexican-Spanish text taken from a scraped from local news articles) then finally fine tuning to your text classification dataset and predicting the different classes. I’m thinking that this approach, conceptually, should assist in under resourced languages.

That is pretty much the intention of ULMFIT
A)train on a big corpus so the model understands the language
B) fine tune on domæne specific task
B.1) one part without classes but from the specific domaine. fx imdb reviews without classes
B.2) one part with classes. fx imdb review with classes
The idea og B.1 is to reduce the need for text with classes as these are often more sparse.

If however you think of B.1 as a second large generic corpus then i would merge it with A and training both the tokenizer and the neural net with both upfront

Your assumption of training a LM on different corpus holds as long as those corpus use the “same” vocab. In the text classification example from the lesson the goal of fine tuning the LM over the reviews is to let the LM fit better the language used inside the very reviews. Doing so the signal(input) it is going to provide to the classification “head” will be more accurate. So it is hard to see how this approach might assist tasks over different under-resourced-languages.

Thanks for the response @fabris. My aim regarding the under resourced languages was to use a well resourced language ( eg Spanish) to train a LM which can be fine tuned to a derivative or dialect ( eg Mexican-Spanish) of this well resourced language. This resulting LM would then hopefully have knowledge of structure of Mexican-Spanish ( which it would have learned from the Spanish model, plus the common vocab) this model could then be used to fine tune a Mexican-Spanish classifier.

At least this is my logic, was just curious if anyone tried a similar approach and got good/bad results. If this doesn’t sound like a terrible idea to the folks on the forum then I’ll give it a bash and report on my results.

Sub: Facing issues when trying to create a language model using amazon reviews.
I am using following link:” for implementing the sentiment-classification using FastAI.
I also studied the Lesson 10 of Fastai lesson.
I am stuck at the code where we are trying to load the wiki LanguageModel , i tried with fastai version 1.0 & 0.7. torch version = 0.4.0 & 0.3.1.

Following is the error snapshot that I faced with torch version 0.3.1 and fastai version 0.7
Error Snapshot:
Traceback (most recent call last):
File “”, line 177, in
wgts[‘0.encoder.weight’] = T(new_w)
TypeError: ‘module’ object is not callable

With torch 0.4.0 this issue is resolved but other issues cropped in.

I think issue is related to compatability of versions between fastai and torch. Request to help suggest for issue resolution.

Following are other details of my environment


you need o create a new conda environment and follow the instructions here:
in order to get a coherenty setup with compatibel libraries. The start usinn the notebook

Yep I think the potion was removed, in favour of bidir training. but feel free to add it.

it was URLs.WT103_1

It is really good to understand every step of your pipeline (i’ve learned it hard way), but weights are often published without exact steps to reproduce them. Think, do you know how pytorch resnet weights are reproduced?

Why not, I’d like to have a way to retrain them if necessary.

I haven’t try it but it should work. Try it. Comparing it with training on a merged Mexican-Spanish and Spanish wikipedia is quite interesting.

1 Like

looks interesting to speed up training and to get better accuracy, I would love to see it in action

I haven’t got you. You need to get that joined in LanguageModelLoader to create batches.

I would be interested to see how this helps

That goes against the idea of quick finetuning.

That was on a server that was destroyed and I don’t have the weights. But I have some pretrained models with weights but I need to check how good they are.

@piotr.czapla thanks for your feedback

Well, I think concatenating all documents is not really necessary, just easier (and more efficient computationally).

I would imagine taking a bucket of similar-length sequences, bptt-ing through them from their beginning to end (using PackedSequences to avoid padding), and then taking a next bucket of sequences.

Well, what I meant was:

  1. take the general LM (trained on wikitext)
  2. fine-tune the LM on the downstream task’s dataset
  3. add the classifier head to the LM, without removing the old one, train both further.

The hypothesis: on a small dataset of short text, fine-tuning the classifier might make the encoder “forget” the language. It also might not, and I guess it is not that easy to test.

1 Like

I am using pip for installing & all my packages are as suggested in this page.

Please help me to resolve the issue.