ULMFiT - French

I haven’t sent a request to them as @claeyzre found the data so we can train and see. But Indeed it would be good to double check with them if they are okey with us using their data.
You have superb results if we haven’t make a mistake. the F1 is tricky as there are different incompatible implementation of F1 micro. For example, scikit calculate F1 differently than it is described on wikipedia pages what is worse the results differs a lot.

For German I’ve calculated the F1 by hand using the data from the paper to reverse engineer the formula used in competition and then I’ve implemented a F1 calculation for my scripts using numpy.
I think we can do the same for the DEFT paper. 0.5 would be amazing result.

Please share the code. How about integrating into ulmfit-multilingual?

1 Like

Ok, I will contact the guys from the competition to try to get their approval and the official data/results .

For F1 macro (it seems to be the metric of the competition, not F1 micro), I used two different implementations: a sklearn based one and a custom one I have coded (using the wikipedia formulas). And as you said, the results are not identical (I think this is because, when one class is not predicted, sklearn uses a fscore of 0, which lower the result in F1 macro) but very close (0.54 is the sklearn result, my custom implementation gives a slightly better score).
This is the sklearn based implementation that gave me 0.54:

  def f1_sklearn(y_pred:Tensor, y_true:Tensor):

    y_pred = y_pred.max(1)[1]

    res = f1_score(y_true, y_pred, average='macro')
    return Tensor([res]).float().squeeze()

I should be able to share the full code next Monday.
Yes it’s a good idea to integrate it on ulmfit-multilingual, as it’s used a lot. There is already a fbeta metric in fastai, but it does not manage multiclass (with macro or micro score).

1 Like

I have pushed on github the movie reviews classifier notebook as well as the weights/vocab of the french LM.
You can download it from here: https://github.com/tchambon/deepfrench

Still waiting for an answer of DEFT competition people to confirm the very good results (new SOTA) on the 4 labels tweets classification.


Thank you for sharing! :blush:

Good Work!

Could you give us more information about the imdb-like french movie review dataset?

The website is named Allocine, I took the data using web scraping.



Trying to adapt imdb notebook , (fastai v0.7) to use pretrained
but I get this error

KeyError: ‘unexpected key “0.encoder_dp.emb.weight” in state_dict’

on the other hand fastai v1 , I get also an error

Thanks for any hint


Hi man,
I gave a look at your github and congratulations for a really great job.
Just a question, I am not sure what does the itosref30.pkl file contain? It should have only the data_lm.vocab.itos, where data_lm is the ‘general’ french language model, right?

hi mr @tomsthom i’m working on french unlabeled tweet and as i’m new on nlp i wonder how can i use your model to label it if you can give me some advice please send me your mail thanks

Hi @claeyzre, your link to the tweets dataset in French does not work anymore. Can you share your dataset through another link? Thank you.

Hi @piotr.czapla and @tomsthom: did you get an official answer from https://deft.limsi.fr/2017/ about the tweets dataset in French? Thank you.

@tomsthom: could you share your film reviews dataset in French in order to make my own tests and share results? Thank you.

Hi, thanks to the new fastai NLP course of Rachel, we get online the Jupyter Notebook nn-vietnamese.ipynb that allows anybody to train any Language Model from scratch. I’m testing it to create a French LM (FLM) and I will publish a post about this.

So now, the question is the use of such a FLM (for classification tasks for example in a specific domain = ULMFiT)… and to be able to benchmark the results!

We need for that public datasets in French. Clearly, we could start with datasets for sentiment classification. Could we share here the links to them (if any)?

I don’t remember if i got the response., but since then I’ve got access to MLDoc & CLS (https://s3.amazonaws.com/amazon-reviews-pds/readme.html) that has FR included as a language and we’ve got pretty good results using both LSTM and QRNN version ULMFiT. We have a paper & blog post coming out soon. If you want to give us a hand you could retrain the QRNN and LSTM versions of language model using the latest hyper parameters from fastai. The language models trained for the paper are using a little bit buggy sentence piece and old hyper parameters.

I’ve also rewrote all our experimentation framework to use current fastai. It is basically high level api on top of fastai. Here is an example showing how to train JA version (https://github.com/n-waves/ulmfit-multilingual/blob/multifit/notebooks/JA-multifit_fp16.ipynb).
Although running just few commands will be helpful it is going to be boring and not worth writing a blog post.

What could be interesting is to investigate how much better our modeli is going to be if we pretrain LM on corpus closer to the domain than wikipedia. It gave use huge boost in performance for polish hate speech detection.We pretrained LM on polish reddit and finetuned it on hate speech dataset which are tweets.

Great! I will retrain the french model using this.
I have also worked on tweets, I will try to check if using a corpus like reddit is better than wikipedia.
@piotr.czapla did you pretrained on Wikipedia then on reddit then on your hate speech dataset? Or only trained on reddit then finetuning on your hate speech dataset?

The dataset for movie reviews is web scrapped from a famous french movie review website.
But it is against the ToS of the site to do this, so I am not allowed share the dataset (I have asked them but didn’t get any replies…).

Thanks for the link. Do you know if we can upload the French Amazon Reviews file to kaggle and then launch a sentiment classification competition in order to verify that the ULMFiT method gives the SOTA?

Guide to download the French Amazon Customer Reviews

Read information page and license about Amazon Customer Reviews Dataset.

  1. Create an AWS Free Tier account.

  2. Login to your AWS account to the IAM console with the login/password of step 1.

  3. Create en IAM Admin User and Group by following theses rules.

  4. Create your IAM user access keys (access key ID and secret access key) by following theses rules. DO NOT FORGET to save your 2 keys.

  5. Install the AWS Command Line Interface (aws cli) in an ubuntu terminal on your computer by following theses rules.

  6. Configure your aws cli by following theses rules.

  7. With you aws cli, you can list the available reviews datasets in the bucket with the ls command by typing the following code in your ubuntu terminal:
    aws s3 ls s3://amazon-reviews-pds/tsv/

    List (2017-11-24):

  8. To download data using the aws cli, you can use the cp command. For instance, the following command will copy the file named amazon_reviews_multilingual_FR_v1_00.tsv to your local data folder:
    cd path_to_your_data_folder
    aws s3 cp s3://amazon-reviews-pds/tsv/amazon_reviews_multilingual_FR_v1_00.tsv .

  9. Unzip your file:
    gzip -d amazon_reviews_multilingual_FR_v1_00.tsv.gz

  10. In your jupyter notebook, open your tsv file with pandas with for example the following code (see list of columns names):

    fields = ['review_id', 'review_body', 'star_rating']
    df = pd.read_csv(path_data/'amazon_reviews_multilingual_FR_v1_00.tsv', delimiter='\t',encoding='utf-8', usecols=fields)
    df = df[fields]

That’s it. You can start fine-tuning your LM model and then fine-tuning your classifier with the French Amazon Customer Reviews by using the ULMFiT method implemented in the nn-vietnamese.ipynb notebook. Have fun and please, publish your results. Thanks :slight_smile:

1 Like

I created a French SentenPiece LM on Wikipedia, the models are available here

1 Like

[ EDIT 09/22/2019 ] I’ve finally trained a third French Bidirectional Language Model with the MultiFiT configuration. Then, if everything written in this post is still relevant, I’m publishing a new post on this third model that performs better than the 2 previous ones.

French Bidirectional Language Model (FBLM)

Happy to publish my French Bidirectional Language Model (FBLM) trained on a subset (about 100 millions of tokens) of the French Wikipedia.

  • notebooks/models parameters/vocab on github
  • post (in French) on medium

In fact, I published 2 FBLM:

  • one with a AWD-LSTM architecture (fastai default: 3 layers, 1152 hidden parameters) and the spaCy tokenizer (vocab of 60 000 tokens, min_freq of 2)
  • another one with a QRNN architecture (fastai default: 3 layers, 1152 hidden parameters) and the SentencePiece tokenizer (vocab of 15 000 tokens)

To be noticed
The FBLM trained with a QRNN architecture and the SentencePiece tokenizer got a better performance.

To be improved
I read the MultiFiT paper (sadly after training my 2 models…) and saw at the end of the paper a list of the hyperparameters values used for the training like 4 QRNN (and not 3 AWD-LSTM), 1550 hidden parameters by layer (and not 1152), no dropout, batch size of 50, etc.
In order to improve the performance of a FBLM, this configuration should be tested.

ULMFiT on the “French Amazon Customer Reviews” (FACR) dataset

In order to test my 2 FBLM, I fine-tuned them on the FACR dataset (see download guide) and fine-tuned after a Sentiment Classifier following the ULMFiT method.

Unlike the 2 FBLMs, the Bidirectional French LM (lm-french.ipynb) and Sentiment Classifier (lm-french-classifier-amazon.ipynb) with a AWD-LSTM architecture and using the spaCy tokenizer have got better results (accuracy, perplexity and f1) than the Bidirectional French LM (lm2-french.ipynb) and Sentiment Classifier (lm2-french-classifier-amazon.ipynb) with a QRNN architecture and using the SentencePiecce tokenizer.

BUT, we found (sadly after the training of our models…) that 11 098 reviews were not in French in the supposed-to-be French dataset (almost 5% out of the 230 684 reviews of our filtered dataset that kept only negative (1 or 2 stars) and positive (4 or 5 stars)).

We should delete these 11 098 review and re-fine-tune our LM and after our sentiment classifier on the only-French reviews dataset.
One more thing: the dataset is unbalanced (about 90% of positive reviews against 10% only of negative ones). A weighted loss was used in order to deal with this problem but other techniques should be tested (oversampling, undersampling, etc.), too.

The results on the validation set (I used: 10% of the dataset and seed=42) of the Sentiment Classifier (lm-french-classifier-amazon.ipynb) with a AWD-LSTM architecture and using the spaCy tokenizer are:

  • accuracy : (global) 95.97%** | (neg) 92.95% | (pos) 96.35%

Final thoughts

  • The French Bidirectional Language Model should be retrained with the MultiFiT hyperparameters values.
  • Then, the 2 fine-tuned LM and Sentiment Classifier models should be retrained with the MultiFiT hyperparameters values and on a filtered French dataset (with only French reviews).
  • The filtered French dataset should be uploaded online in order to launch a competition on French reviews classification (if the Amazon License allows that).

(if someone wants to do that, I will be happy to help)


[ EDIT 10/20/2019 ] I’ve updated the notebook lm3-french-classifier-amazon.ipynb (nbviewer of the notebook) with the right code to use the SentencePiece model and vocab trained for the general LM into the specialized one and the classifier (see explanations at the top of the notebook).

(MultiFiT) French Bidirectional Language Model (FBLM)

As edited in this previous post, I’ve finally trained a third French Bidirectional Language Model with the MultiFiT configuration. This LM performs better.

Architecture, training method and performance

You will find the notebook lm3-french.ipynb of the model training and the link to download model parameters and vocab in my Language Models github repository.

The architecture used for this FBLM is 4 QRNN with 1550 hidden parameters by layer. The SentencePiece tokenizer (15 000 tokens) was used instead of the spaCy one.

Both the forward and backward models were trained on 10 epochs with a batch size of 50.

As the FBLM was trained on a big corpus of 100 millions tokens (extraction from the French wikipedia download of about 500 millions tokens), I did not needed much regularization. Therefore, I set mult_drop to zero (no dropout) and kept the default fastai weight decay of 0.01.

For the training, I used one NVIDIA GPU v100 on GCP.

accuracy perplexity training time
forward 39.68% 21.76 8h
backward 43.67% 22.16 8h

PS: the training times given in the table are the sum of fastai Databunch creation time + model training time on 10 epochs.

Sentiment Classifier on the “French Amazon Customer Reviews”

Finally, I fine-tuned this French Bidirectional LM to train a Sentiment Classifier on the “French Amazon Customer Reviews” dataset (see notebook lm3-french-classifier-amazon.ipynb (nbviewer of the notebook)).

The performance of my MultiFiT SentencePiece (15 000 tokens) French Classifier is similar to the ULMFiT spaCy (60 000 tokens) one.

If you take in count the fact that the training time of the FBLM using the MultiFiT configuration was faster of 5 hours than the one using the ULMFiT one and that the FBLM performance was better as well (see comparaison tables in this page), it means that using the MultiFiT configuration is a good choice for fine-tuning a Sentiment Classifier with a Language Model.

Bonus: I love the function show_intrinsic_attention() that allows to visualize the words that have contributed the most to the decision of the classifier. An example below with a French product review :slight_smile: