I have worked on implementing ULMFit for the French language using fastai v1.
I have created two datasets for this task:
Language model: extract of Wikipedia in French (100M tokens) and a 30K vocab.
Classification (sentiment analysis): movie reviews using an imdb-like french website. The dataset contains 11K positives reviews, 11K negatives reviews as well as 51K unlabelled reviews for language model tuning.
My results so far:
Language model with 100k tokens and a 30K vocab: accuracy of 0.3570, perplexity of 24.36.
Classification: accuracy of 0.9349 using pretrained LM model and fine tuning with the 51K unlabelled reviews.
Without using the pretrained LM, I still get a 0.89 accuracy if I train the LM on the 51k unlabelled reviews.
It seems there is no public benchmark for the french language. I am working on a blog post to present the model.
I am trying to contact and convince a french movie reviews website to release a labelled dataset of movies review, to create a first benchmark.
You could also use this dataset : https://deft.limsi.fr/2017/
It’s not movie reviews but tweets, and there are four labels ( objectif , positif , négatif ou mixte). For simplicity you could just use positive and negative samples and use their results as a baseline.
Would be very interested in seeing the results, I’m trying to do a French Model myself
Tom Superb work! @jeremy Have you seen this 93% accuracy!
It seems that we have another superb result of ULMFiT. @tomsthom we should make it public on Twitter. Can you share your tweet handle?
This is exactly what we are doing with Polish. It is going very slow :). As a last resort, we are planning to simply publish a list of urls and a way to fetch them yourself if anyone wants to verify the results.
I’ve been struggling to do the same thing for #HINDI using fastai v1. Any link to the code for references. Have been able to build the language model however, facing issues in transferring it for text classification. Much appreciated.
I will share the code on github very soon, I need to do some cleaning before.
If you need it urgently, I can send you a working version as an example (but not cleaned).
I sent you a link with the notebook by PM.
After having cleaned and modified the code to take into account the last update of fastaiV1, I will publish it on github.
@piotr.czapla did you get an answer about the data of the DEFT competition?
Running ULMFiT on the 4 class tweet classification, I can get a macro fscore around 0.54 easily (could be improved with more hyper parameters tuning).
The results I have seen of the competition (https://deft.limsi.fr/2017/actes_DEFT_2017.pdf#page=107) show a best macro fscore of 0.276. It would be a huge improvement of SOTA!
But we have to confirm we have the correct data (since it comes from a non official github repo) and that this PDF show the best official competition results.
I haven’t sent a request to them as @claeyzre found the data so we can train and see. But Indeed it would be good to double check with them if they are okey with us using their data.
You have superb results if we haven’t make a mistake. the F1 is tricky as there are different incompatible implementation of F1 micro. For example, scikit calculate F1 differently than it is described on wikipedia pages what is worse the results differs a lot.
For German I’ve calculated the F1 by hand using the data from the paper to reverse engineer the formula used in competition and then I’ve implemented a F1 calculation for my scripts using numpy.
I think we can do the same for the DEFT paper. 0.5 would be amazing result.
Please share the code. How about integrating into ulmfit-multilingual?
Ok, I will contact the guys from the competition to try to get their approval and the official data/results .
For F1 macro (it seems to be the metric of the competition, not F1 micro), I used two different implementations: a sklearn based one and a custom one I have coded (using the wikipedia formulas). And as you said, the results are not identical (I think this is because, when one class is not predicted, sklearn uses a fscore of 0, which lower the result in F1 macro) but very close (0.54 is the sklearn result, my custom implementation gives a slightly better score).
This is the sklearn based implementation that gave me 0.54:
I should be able to share the full code next Monday.
Yes it’s a good idea to integrate it on ulmfit-multilingual, as it’s used a lot. There is already a fbeta metric in fastai, but it does not manage multiclass (with macro or micro score).