Hi @tomsthom! Looking forward to your revert. Thanks!
I sent you a link with the notebook by PM.
After having cleaned and modified the code to take into account the last update of fastaiV1, I will publish it on github.
Thanks a ton @tomsthom ! Let me work this out for Hindi and i’ll post my results for it. Keeping fingers crossed.
@tomsthom I would love to see the code as well. Please let us know when it is available on GitHub
Nice work guys!
Would love to test your model or see the code if you are willing to share it.
I am currently trying to train a typo correction tool based on SymSpell and I am looking for a sound french corpus. Any suggestions?
I will publish the code this week. It will work with the last changes of fastai v1.
For french corpus, the simplest is to start with wikipedia FR (more information on how to extract in the LM zoo topic: Language Model Zoo 🦍).
@piotr.czapla did you get an answer about the data of the DEFT competition?
Running ULMFiT on the 4 class tweet classification, I can get a macro fscore around 0.54 easily (could be improved with more hyper parameters tuning).
The results I have seen of the competition (https://deft.limsi.fr/2017/actes_DEFT_2017.pdf#page=107) show a best macro fscore of 0.276. It would be a huge improvement of SOTA!
But we have to confirm we have the correct data (since it comes from a non official github repo) and that this PDF show the best official competition results.
I haven’t sent a request to them as @claeyzre found the data so we can train and see. But Indeed it would be good to double check with them if they are okey with us using their data.
You have superb results if we haven’t make a mistake. the F1 is tricky as there are different incompatible implementation of F1 micro. For example, scikit calculate F1 differently than it is described on wikipedia pages what is worse the results differs a lot.
For German I’ve calculated the F1 by hand using the data from the paper to reverse engineer the formula used in competition and then I’ve implemented a F1 calculation for my scripts using numpy.
I think we can do the same for the DEFT paper. 0.5 would be amazing result.
Please share the code. How about integrating into ulmfit-multilingual?
Ok, I will contact the guys from the competition to try to get their approval and the official data/results .
For F1 macro (it seems to be the metric of the competition, not F1 micro), I used two different implementations: a sklearn based one and a custom one I have coded (using the wikipedia formulas). And as you said, the results are not identical (I think this is because, when one class is not predicted, sklearn uses a fscore of 0, which lower the result in F1 macro) but very close (0.54 is the sklearn result, my custom implementation gives a slightly better score).
This is the sklearn based implementation that gave me 0.54:
def f1_sklearn(y_pred:Tensor, y_true:Tensor): y_pred = y_pred.max(1) res = f1_score(y_true, y_pred, average='macro') return Tensor([res]).float().squeeze()
I should be able to share the full code next Monday.
Yes it’s a good idea to integrate it on ulmfit-multilingual, as it’s used a lot. There is already a fbeta metric in fastai, but it does not manage multiclass (with macro or micro score).
I have pushed on github the movie reviews classifier notebook as well as the weights/vocab of the french LM.
You can download it from here: https://github.com/tchambon/deepfrench
Still waiting for an answer of DEFT competition people to confirm the very good results (new SOTA) on the 4 labels tweets classification.
Thank you for sharing!
Could you give us more information about the imdb-like french movie review dataset?
The website is named Allocine, I took the data using web scraping.
Trying to adapt imdb notebook , (fastai v0.7) to use pretrained
but I get this error
KeyError: ‘unexpected key “0.encoder_dp.emb.weight” in state_dict’
on the other hand fastai v1 , I get also an error
Thanks for any hint
I gave a look at your github and congratulations for a really great job.
Just a question, I am not sure what does the itosref30.pkl file contain? It should have only the data_lm.vocab.itos, where data_lm is the ‘general’ french language model, right?
hi mr @tomsthom i’m working on french unlabeled tweet and as i’m new on nlp i wonder how can i use your model to label it if you can give me some advice please send me your mail thanks
@tomsthom: could you share your film reviews dataset in French in order to make my own tests and share results? Thank you.
Hi, thanks to the new fastai NLP course of Rachel, we get online the Jupyter Notebook nn-vietnamese.ipynb that allows anybody to train any Language Model from scratch. I’m testing it to create a French LM (FLM) and I will publish a post about this.
So now, the question is the use of such a FLM (for classification tasks for example in a specific domain = ULMFiT)… and to be able to benchmark the results!
We need for that public datasets in French. Clearly, we could start with datasets for sentiment classification. Could we share here the links to them (if any)?
I don’t remember if i got the response., but since then I’ve got access to MLDoc & CLS (https://s3.amazonaws.com/amazon-reviews-pds/readme.html) that has FR included as a language and we’ve got pretty good results using both LSTM and QRNN version ULMFiT. We have a paper & blog post coming out soon. If you want to give us a hand you could retrain the QRNN and LSTM versions of language model using the latest hyper parameters from fastai. The language models trained for the paper are using a little bit buggy sentence piece and old hyper parameters.
I’ve also rewrote all our experimentation framework to use current fastai. It is basically high level api on top of fastai. Here is an example showing how to train JA version (https://github.com/n-waves/ulmfit-multilingual/blob/multifit/notebooks/JA-multifit_fp16.ipynb).
Although running just few commands will be helpful it is going to be boring and not worth writing a blog post.
What could be interesting is to investigate how much better our modeli is going to be if we pretrain LM on corpus closer to the domain than wikipedia. It gave use huge boost in performance for polish hate speech detection.We pretrained LM on polish reddit and finetuned it on hate speech dataset which are tweets.