ULMFit - Portuguese

Yes! Sure, let’s do it!
What would be necessary to do the benchmark?

I believe we can qet a better result if we use Fastai v1. Today I started looking at the text example notebook. I had some issues adapting the notebook to train pt wiki from scratch and to fine tune it with our dataset.

Another promising application I´m working on is a text classification model for the Federal Senate of Brazil : Every week, dozens of requests are made to one of the four team of consultants in the Legislative Consulting Department . These solicitations include requests for studies, proposals for bills, production of speeches, among others. The idea is to automatically define which team should attend the demand based on the text(usually 3 to 10 lines of text) that describes the legislative request. The preliminary result was 86% accuracy which is not fantastic , but it is good enough, considering that the human performance today is 89% accuracy.

Very excited to use what I learned on the Fastai courses in a real world situation.

3 Likes

Hi @NandoBr and @monilouise, congrats for the results!

I had the same problem, I did the pretraining of a Portuguese model based on wikipedia, but suffered from the lack of resources and benchmarks in Portuguese to verify the results against other models.

Check this resources page from NILC, they may have a dataset for text classification to test for benchmarks. Last time I checked there was only one for sentiment analysis based on tweets.

You may also look for in the NLP labs from UFRGS or PUCRS.

@NandoBr, I started the pretraining for the same wikipedia dump using the v1, but I ended up having memory issues, which I believe were also happening with other users. The memory consumption is increasing at every few epochs. I have a GTX 1070 with 8 Gb, and had to reduce the batch size to 16 (with 64 it was working OK on the previous version), and even doing so I ended up having out of memory error by epoch 10.

2 Likes

@pvcastro, Thanks for the valuable tips. I will check these sources and give feedback later.
I’m having same memory issues. Let’s keep trying…

@piotr.czapla or @jeremy are you aware of these memory issues in v1?

Great work, thanks!

Great! We are getting ulmfit updated to version 1 here: Multilingual ULMFiT , we will have a separate repo, and we will kill the memory issues.
I think to compare with SOTA you can still use the old version, so we have the results and we can rewrite them to v1 once the issues are fixed.

Awesome application :slight_smile: thank you for sharing I hope we would have such proactive guys working in federal offices Poland!

Report this as an issue in github fastai and let’s try to fix this. I had a similar issue on image classification for long training times, and if you are ok at coding try to get the issue debugged.

Thank you for finding that baseline!

Hi @piotr.czapla,

My Twitter: @monilouise (the same as here)

@monilouise, @NandoBr do you have your model weights published somewhere ?

I have it at Google drive, but only the pretrained model in a subset from Wiki PT. I have the other weights at my Paperspace instance and hope to share it as soon as possible. Just to highlight some results, I trained the following variations:

  • 1-cycle in the pretraining (Wikipedia)
  • 1-cycle also in the second training (specific corpus)
  • 1-cycle both in the second training and in the classification phase: here, I got the best result (97%)

Also, I intend to train in Wikipedia for more epochs, but I don’t know if it will lead to significant improvements.

Further, let me describe my application: in Brazilian Court of Audit, we have a classification according to themes, but currently this classification is done for small texts (about one paragraph). That dataset was used in the experiment reported here. But the next challenge is to extend this classification to larger texts, and assume a multilabel classification. Eg.: a document has several paragraphs, and some of these paragraphs can be classified each one in a specific theme. So, a document may theoretically be classified in more than one theme. My team sees this task as a proof of concept for further applications (eg.: semantic search), and we are at the moment formally requiring the documents from another department. There are some confidentiality issues being addressed.

I hope to have some results until December.

1 Like

Yes, @jeremy.

I intend to publish updated results in new posts.

Hi @jeremy , it would be wonderfull if somebody could show a Fastai V1 notebook example on how to train a wiki LM in other language than english, like lang=‘pt’, and how to pass the parameters for LM fine tuning and classification. This would be a great help for everybody interested in developing ULMFit in other languages around the world.

1 Like

I was doing the pretraining like this, from what I could gather from the v1 docs. Not sure if it would be the best way to do it. I did end up hitting an out of memory error by the 10th epoch, as I mentioned earlier. Did you give it a try @NandoBr?

def pretrain(path, language_model_id, epochs, batch_size, use_ids=True):
    # path = Path('/fastai/data/wiki/pt')

    if use_ids:
        data_lm = TextLMDataBunch.from_id_files(path=path, train='train', valid='valid', test='test', max_vocab=60000,
                                            min_freq=2, n_labels=0, chunksize=24000, bs=batch_size)
    else:
        data_lm = TextLMDataBunch.from_csv(path=path, tokenizer=Tokenizer(lang='pt'), train='train', valid='valid',
                                       test='test', max_vocab=60000, min_freq=2, n_labels=0, chunksize=24000)

    learn = RNNLearner.language_model(data_lm, drop_mult=0.5)

    lr = float(1e-3)

    learn.fit(epochs=epochs, lr=slice(lr / 2.6, lr), wd=1e-7)
    learn.save(language_model_id)

Hi Pedro,
I didn’t have time to spend more time on it. I’m waiting for some tip on how to avoid the problem. If I were you , I would get more info about the error and post it at ULMFit main thread. Maybe somebody can help us.
Good luck and let’s keep in touch…

As discussed with @piotr.czapla, I’ll open an issue at the fastai github. I’m gathering the information needed for the issue, and will open it once I’m done.

For a quick fix simply save the best model and restart the training by loading the weights,
to do so use SaveModelCallback https://docs.fast.ai/callbacks.tracker.html

1 Like

OK, thanks!

Hi, we are trying to make summary of ulmfit efforts see: Multilingual ULMFiT
Do you have some results trained on the publicly available / official data set?

Hi @piotr.czapla ,
Yes, I can help with that. I have a few details that I’d like to discuss with you. Can I have your e-mail?

piotr.czapla@gmail.com

Ok. Thanks

Wow, really cool to find you guys around here working with portuguese datasets.

I would love to cooperate! :wink:

Hey guys!

Whats the status on this? Do we have a portuguese model already? if not, how can i help?

@piotr.czapla do you know how do i start to contribute to this? any place i can start?

Thanks and congrats for the efforts!