I have worked on implementing ULMFit for the French language using fastai v1.
I have created two datasets for this task:
- Language model: extract of Wikipedia in French (100M tokens) and a 30K vocab.
- Classification (sentiment analysis): movie reviews using an imdb-like french website. The dataset contains 11K positives reviews, 11K negatives reviews as well as 51K unlabelled reviews for language model tuning.
My results so far:
- Language model with 100k tokens and a 30K vocab: accuracy of 0.3570, perplexity of 24.36.
- Classification: accuracy of 0.9349 using pretrained LM model and fine tuning with the 51K unlabelled reviews.
Without using the pretrained LM, I still get a 0.89 accuracy if I train the LM on the 51k unlabelled reviews.
It seems there is no public benchmark for the french language. I am working on a blog post to present the model.
I am trying to contact and convince a french movie reviews website to release a labelled dataset of movies review, to create a first benchmark.
The code of the classifier and the model weights can be found on github here: https://github.com/tchambon/deepfrench