ULMFit - Portuguese

(Pedro) #21

As discussed with @piotr.czapla, I’ll open an issue at the fastai github. I’m gathering the information needed for the issue, and will open it once I’m done.

0 Likes

(Piotr Czapla) #22

For a quick fix simply save the best model and restart the training by loading the weights,
to do so use SaveModelCallback https://docs.fast.ai/callbacks.tracker.html

0 Likes

(Pedro) #23

OK, thanks!

0 Likes

(Piotr Czapla) #24

Hi, we are trying to make summary of ulmfit efforts see: Multilingual ULMFiT
Do you have some results trained on the publicly available / official data set?

0 Likes

(Fernando Melo) #25

Hi @piotr.czapla ,
Yes, I can help with that. I have a few details that I’d like to discuss with you. Can I have your e-mail?

0 Likes

(Piotr Czapla) #26

piotr.czapla@gmail.com

0 Likes

(Fernando Melo) #27

Ok. Thanks

0 Likes

(Danilo Ribeiro) #28

Wow, really cool to find you guys around here working with portuguese datasets.

I would love to cooperate! :wink:

0 Likes

(joao.br) #29

Hey guys!

Whats the status on this? Do we have a portuguese model already? if not, how can i help?

@piotr.czapla do you know how do i start to contribute to this? any place i can start?

Thanks and congrats for the efforts!

0 Likes

(Piotr Czapla) #30

Hi Joao We do have a model but I suspect that it isn’t that performant as it could be as it was trained on wikipedia that has formal language. You could try to train a model that is using informal language like one used on reddit or twitter. We had good results on Polish language by pretraining on different text types. We could then test that model on some dataset and compare it to one pretrained on wikipedia.

0 Likes

(joao.br) #31

Hello Piotr. Where can i find the notebook used to train this model? Do you guys have a sample notebook on how to train a new model? Or even how you trained for english, or this Polish model you cited? That would be awesome. I want to help but i dont know how, and really didnt want to start from scratch…
thanks and congratulations for the work being done…

0 Likes

(Edmundo) #32

Using fast.ai version 51, just follow the template below in order to train a portuguese language model:

from fastai.text import *
path = Path(“data/wiki/pt/”)
tokenizer = Tokenizer(lang=‘pt’, n_cpus=4)
lm_data = TextLMDataBunch.from_csv(path, ‘train.csv’, tokenizer=tokenizer, label_cols=None, text_cols=0, chunksize=5000, max_vocab=60000, bs=64)
learn = language_model_learner(lm_data, AWD_LSTM, pretrained=False, drop_mult=0.)
learn.lr_find()
learn.recorder.plot(suggestion=True)
learn.fit_one_cycle(10, 2e-3, moms=(0.8,0.7))
learn.save(‘pt_language_model’)
lm_data.vocab.save(path/f"models/pt_itos.pkl")

2 Likes

(Patrick Blackman Sphaier) #33

Hi @NandoBr, I´m working on a 60K PT-wiki LM model, and came accross your work while looking for benchmarks in portuguese. I believe that the classification results your team achieved over the TCU dataset is a great benchmark. Would you mind sharing some details, like the train-valid-test split ratios used and per class metrics (f-score/accuracy)?
Best Regards

0 Likes

(Patrick Blackman Sphaier) #34

Never mind, I´ve just found the split ratios in the notebook you shared!

0 Likes

(Rodrigo Pinto Coelho) #35

I have a public school that is interested in building a chat bot to help high school children with their home work. Would love to help you guys to see if we could get something like this done. I’m thinking that the hardest part would be to get the model to recognize the math symbols, but I would be curios to see what we could obtain if we specialized the language model with about 20000 questions and answers in math.
Can I help?

0 Likes