A Code-First Introduction to Natural Language Processing 2019

pierreguillou · September 10, 2019, 8:51pm

On my side, I reduced by 5 the size of my databunch of the French wikipedia dataset from 5.435 Go to 1.077 Go in order to have a corpus of about 100 millions of tokens as adviced by Jeremy.

Note: if I understand well that using a databunch of 1 Go instead of 5 Go will speed up the ETT (Epoch Traing Time) of my Language Model, I think that I’m loosing a lot of language knowledge. Do you have an opinion on that?

On this base, I’m training today my French LM with the learner parameters values of the nn-vietnamese.ipynb notebook:

# bs = 128
# len(vocab.itos) = 60 000
learn = language_model_learner(data, AWD_LSTM, drop_mult=0.5, wd=0.01, pretrained=False).to_fp16()

After that, I will train it from scratch with the learner parameters values of the nn-turkish.ipynb notebook like you did (ie, drop_mult=0.1 & wd=0.1):

learn = language_model_learner(data, AWD_LSTM, drop_mult=0.1, wd=0.1, pretrained=False).to_fp16()

I will published the results afterwards.