ULMFit - Portuguese

NandoBr · November 5, 2018, 11:19am

Portuguese Language Model: Here is the place for collaboration in the creation and improvement of the model. @monilouise , @NandoBr and other guys from Brasília worked in parallel .
Best Result for Text classification:
Wiki perplexity: 3.46 , classification accuracy: 97%
dataset: 10,246 rows, 4 classes unbalanced (https://bit.ly/2Qohkhe)

piotr.czapla · November 6, 2018, 10:28am

@NandoBr have you made any research on Portuguese, what text classification data sets are available what is the SOTA?

monilouise · November 6, 2018, 2:41pm

Hi @piotr.czapla,

We and @NandoBr worked in parallel with relation to Portuguese ULMFit LM. I posted some results in my blog: https://monilouise.wordpress.com/2018/08/14/general-domain-portuguese-language-model/. I believe the perplexity results may improve with further training epochs.

Further, I used my pretrained language model to test it for classification purposes with a Portuguese dataset (prepared by @NandoBr) and the best accuracy obtained was 97%. @NandoBr has also carried out similar experiments and may share its results. AFAIK, his language model also had excellent results in perplexity. Here’s the source code: https://github.com/fastai-bsb/nlp-tcu-enunciados/blob/master/1cycle_tcu.ipynb.

We both work for public administration here in Brazil (despite different government agencies). At the moment, I’m waiting for having access to documents from Brazilian Court of Audit for further experiments with the same model, but in a more complex scenario. And yes, BERT is in the roadmap.

piotr.czapla · November 6, 2018, 2:49pm

Awesome results! 97% is amazing. Can you tell more about your dataset, what website, how many examples? Any plans to get this dataset public? Was it sentiment analysis?

@NandoBr can you put that info into the first comment so that it appears everywhere where this thread is cited?

And can I get your twitter handles?

NandoBr · November 7, 2018, 12:19am

Info added to the first comment.
my twitter : @Fmelobr
Best

piotr.czapla · November 7, 2018, 9:53pm

@jeremy it seems that we have another language model ready :). Monique and Fernando tested their LM on text classification If I’m not mistaken they classified documents to 4 classes: Personal, Bidding, Responsibility & Legal Process.

This work does not proof that we can beat SOTA in Portuguese, but it is super practical 97% accuracy looks super useful.

piotr.czapla · November 7, 2018, 9:53pm

@monilouise, @NandoBr do you have your model weights published somewhere ?
Would you like to give it a try on some benchmark dataset so we can claim SOTA for Portuguese text classification ?

jeremy · November 7, 2018, 10:11pm

Very interesting. Any chance you could update your post with a little information on the classification task and results? I think that is an application a lot of folks would love to hear about.

NandoBr · November 8, 2018, 1:20am

Yes. We’d love to benchmark our model against SOTA.
What is necessary? Where can I get SOTA record in portuguese?

NandoBr · November 8, 2018, 1:26am

Yes! Sure, let’s do it!
What would be necessary to do the benchmark?

NandoBr · November 8, 2018, 2:47am

I believe we can qet a better result if we use Fastai v1. Today I started looking at the text example notebook. I had some issues adapting the notebook to train pt wiki from scratch and to fine tune it with our dataset.

Another promising application I´m working on is a text classification model for the Federal Senate of Brazil : Every week, dozens of requests are made to one of the four team of consultants in the Legislative Consulting Department . These solicitations include requests for studies, proposals for bills, production of speeches, among others. The idea is to automatically define which team should attend the demand based on the text(usually 3 to 10 lines of text) that describes the legislative request. The preliminary result was 86% accuracy which is not fantastic , but it is good enough, considering that the human performance today is 89% accuracy.

Very excited to use what I learned on the Fastai courses in a real world situation.

pvcastro · November 8, 2018, 9:46am

Hi @NandoBr and @monilouise, congrats for the results!

I had the same problem, I did the pretraining of a Portuguese model based on wikipedia, but suffered from the lack of resources and benchmarks in Portuguese to verify the results against other models.

Check this resources page from NILC, they may have a dataset for text classification to test for benchmarks. Last time I checked there was only one for sentiment analysis based on tweets.

You may also look for in the NLP labs from UFRGS or PUCRS.

@NandoBr, I started the pretraining for the same wikipedia dump using the v1, but I ended up having memory issues, which I believe were also happening with other users. The memory consumption is increasing at every few epochs. I have a GTX 1070 with 8 Gb, and had to reduce the batch size to 16 (with 64 it was working OK on the previous version), and even doing so I ended up having out of memory error by epoch 10.

NandoBr · November 8, 2018, 10:02am

@pvcastro, Thanks for the valuable tips. I will check these sources and give feedback later.
I’m having same memory issues. Let’s keep trying…

pvcastro · November 8, 2018, 10:07am

@piotr.czapla or @jeremy are you aware of these memory issues in v1?

Great work, thanks!

piotr.czapla · November 8, 2018, 11:06am

Great! We are getting ulmfit updated to version 1 here: Multilingual ULMFiT , we will have a separate repo, and we will kill the memory issues.
I think to compare with SOTA you can still use the old version, so we have the results and we can rewrite them to v1 once the issues are fixed.

Awesome application thank you for sharing I hope we would have such proactive guys working in federal offices Poland!

Report this as an issue in github fastai and let’s try to fix this. I had a similar issue on image classification for long training times, and if you are ok at coding try to get the issue debugged.

Thank you for finding that baseline!

monilouise · November 8, 2018, 3:54pm

Hi @piotr.czapla,

My Twitter: @monilouise (the same as here)

@monilouise, @NandoBr do you have your model weights published somewhere ?

I have it at Google drive, but only the pretrained model in a subset from Wiki PT. I have the other weights at my Paperspace instance and hope to share it as soon as possible. Just to highlight some results, I trained the following variations:

1-cycle in the pretraining (Wikipedia)
1-cycle also in the second training (specific corpus)
1-cycle both in the second training and in the classification phase: here, I got the best result (97%)

Also, I intend to train in Wikipedia for more epochs, but I don’t know if it will lead to significant improvements.

Further, let me describe my application: in Brazilian Court of Audit, we have a classification according to themes, but currently this classification is done for small texts (about one paragraph). That dataset was used in the experiment reported here. But the next challenge is to extend this classification to larger texts, and assume a multilabel classification. Eg.: a document has several paragraphs, and some of these paragraphs can be classified each one in a specific theme. So, a document may theoretically be classified in more than one theme. My team sees this task as a proof of concept for further applications (eg.: semantic search), and we are at the moment formally requiring the documents from another department. There are some confidentiality issues being addressed.

I hope to have some results until December.

monilouise · November 8, 2018, 3:55pm

Yes, @jeremy.

I intend to publish updated results in new posts.

NandoBr · November 13, 2018, 6:38pm

Hi @jeremy , it would be wonderfull if somebody could show a Fastai V1 notebook example on how to train a wiki LM in other language than english, like lang=‘pt’, and how to pass the parameters for LM fine tuning and classification. This would be a great help for everybody interested in developing ULMFit in other languages around the world.

pvcastro · November 14, 2018, 8:59am

I was doing the pretraining like this, from what I could gather from the v1 docs. Not sure if it would be the best way to do it. I did end up hitting an out of memory error by the 10th epoch, as I mentioned earlier. Did you give it a try @NandoBr?

def pretrain(path, language_model_id, epochs, batch_size, use_ids=True):
    # path = Path('/fastai/data/wiki/pt')

    if use_ids:
        data_lm = TextLMDataBunch.from_id_files(path=path, train='train', valid='valid', test='test', max_vocab=60000,
                                            min_freq=2, n_labels=0, chunksize=24000, bs=batch_size)
    else:
        data_lm = TextLMDataBunch.from_csv(path=path, tokenizer=Tokenizer(lang='pt'), train='train', valid='valid',
                                       test='test', max_vocab=60000, min_freq=2, n_labels=0, chunksize=24000)

    learn = RNNLearner.language_model(data_lm, drop_mult=0.5)

    lr = float(1e-3)

    learn.fit(epochs=epochs, lr=slice(lr / 2.6, lr), wd=1e-7)
    learn.save(language_model_id)

NandoBr · November 14, 2018, 7:40pm

Hi Pedro,
I didn’t have time to spend more time on it. I’m waiting for some tip on how to avoid the problem. If I were you , I would get more info about the error and post it at ULMFit main thread. Maybe somebody can help us.
Good luck and let’s keep in touch…