I would suggest ensembling the SentencePiece and word-level models (which along with forwards and backwards models mean you’ll be ensembling 4 models in total).
Hi @hafidz. Thanks for checking. I should have provided my updates earlier. Here’s the current progress:
- [DONE] Download and extract Malay Wikipedia corpus
- [DONE] Process text (clean and tokenize text)
- [DONE] Create validation set
- [DONE] Create data loader for training
- [DONE] Numericalize the text
- [DONE] Model setup
- [DONE] Train model
- [DONE] Evaluate language model
- [NOT DONE] Fine-tune language model for text classification task
- [NOT DONE] Build model for text classification
- [IN-PROGRESS] Find curated or publicly available labelled dataset for Malay corpus
- [NOT DONE] Create my own dataset by curating and labelling Malay text scrapped from news sites
- [NOT DONE] Benchmark model for text classification
So, everything is done for language modelling. It took me a while as I am not satisfied with the model performance (perplexity) during the early first few iterations. Currently, I am hitting a roadblock at text classification tasks. Anyway, with that aside, I think the Malay language model is ready to be contributed to the model zoo. So, I will announce this shortly.
I hope your day is going well. I am happy to contribute Malay language model to the model zoo.
ULMFiT in Malay language
The final validation loss was 3.38 (29.30 perplexity) and the accuracy was around 41% on Malay Wikipedia corpus.
- Source code (shell script, Jupyter notebook)
- Pre-trained model weights
- Pre-processed training dataset of Malay Wikipedia
em_sz = 400 # size of each embedding vector nh = 1150 # number of hidden activations per layer nl = 3 # number of layers wd = 1e-7 bptt = 70 bs = 64 opt_fn = partial(optim.SGD, momentum=0.9) weight_factor = 0.3 drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15]) * weight_factor learner.clip = 0.2 lr = 8 # learning rate
learner.fit(lr, 1, wds=wd, cycle_len=10, use_clr=(10,33,0.95,0.85), best_save_name='best_lm_malay_1cycle') # Training loss history epoch trn_loss val_loss accuracy 0 4.114716 3.936571 0.367859 1 3.83864 3.711561 0.382893 2 3.669321 3.603781 0.391633 3 3.63252 3.560706 0.394518 4 3.478959 3.513905 0.399009 5 3.518267 3.480469 0.401523 6 3.409158 3.465206 0.402808 7 3.426483 3.437133 0.405097 8 3.296175 3.409095 0.409595 9 3.185208 3.377671 0.413643
I have tried to speed up training using Leslie Smith’s work on 1cycle policy that he described the super-convergence phenomenon. The model was trained using an implementation of this method in fastai library—Cyclical Learning Rate (CLR). Interestingly, based on my own experiments and observations with this method, the AWS-LSTM model converged faster, instead of 15 epochs, it took just 10 epochs.
It took me around 1 hour 24 minutes to train 1 epoch on one Tesla K80 GPU. The full training took me around 14 hours.
I think there’s room for further improvements. Next up, I plan to build a Singlish language model.
Great job @cedric.
@cedric, nice to see progress, on the Malay language. Can you explain how many words do you have in your vocabulary?
We should check this model on the downstream tasks, like text classificaiton, the issue with language modelling is that you can have superb perplexity if you have small / too small vocabulary, and this perplexity don’t necessary translate to good performance of downstream tasks.
If you can find any competition for Malay, try what we are doing for Polish:
Find any text corpus that you can classify: eg.
- Newspaper Articles: Bussines, Politics, Sport, Fashion etc…
- Sentiment on user comments (we are working with polish version of goodreads to obtain comments)
- Worst case just classify if something is from Wikipedia or from Newspaper
Since such data set would be new you won’t know SOTA but applying text classification without pretrating and with pretaring will give you a good baseline it will show how pretraining help.
Besides google recently release data set search, you may try to find Malay there:
@cedric, @hafidz hat would you say for creating a separate thread for Malay and discuss this there, that way you will have history of your work in one place easy to check have a look how it works for german ULMFiT - German
Great idea. Would love to contribute if there’s more tasks to be done. I’ve created a page for Malay for further discussion for anyone who’s interested. ULMFiT - Malay
Hey, thank you for your reply and your efforts on organizing this thread. Nice job.
Thanks for the tips. I agree. I will work on the downstream tasks soon.
OK, I will take a look.
I found one or two small text corpus hidden in some academic papers published by Malaysia’s local universities. Still evaluating if this corpus is suitable for building and training the model. Good thing is, there’s already an existing benchmark so I can compare my model against it.
Yes, I am aware of Google Data Set Search and have tried finding there and found nothing
Does any of you have tips on how to further reduce my training loss & Accuracy?
I’m currently creating a language model based on the sentiment140 twitter dataset. I already tried varying vocabulary sizes (50k, 25k) Adam optim with low lr, SGD with momentum and high lr, different embedding sizes, hidden layer sizes, batch sizes, different loss multiplication ratios, but no matter what I try I can’t seem to get it below a value of 0.417, but this took 30 epochs. While I see you guys easily getting below 0.4 in just 2 epochs.
My dataset has these properties:
trainingset length: 1.440.000
validation length: 160.000
unique words: 321.439
max vocab used: 50.000/25.000 (min freq. of 4 returns 52k~)
em_sz,nh,nl: 400,1100,3 (Would smaller sizes be better for smaller datasets?)
opt_fn: optim.SGD, momentum=0.9
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15], dtype=“f”)*0.5
Following these I decided to use an lr around 18, but the accuracy seems to fall with rising lr, so should I stick to something between 0 and 2.5 then?
the resulting training looks like this:
epoch trn_loss val_loss accuracy 0 4.885424 4.765574 0.204343 1 4.68609 4.581247 0.219108 2 4.588282 4.500839 0.226138 3 4.54668 4.470822 0.227034 4 4.514667 4.445856 0.229765 5 4.476595 4.433705 0.23107 6 4.479217 4.425251 0.231592 7 4.452099 4.431449 0.230048 8 4.44206 4.419237 0.232063 9 4.436647 4.417188 0.232431 10 4.43317 4.412861 0.232667 11 4.422395 4.413309 0.232941 12 4.414105 4.402681 0.234613 13 4.425107 4.39716 0.234751 14 4.387628 4.395168 0.235595 15 4.402883 4.386707 0.235551 16 4.363533 4.378289 0.238221 17 4.357185 4.37697 0.237533 18 4.367101 4.368633 0.237971 19 4.313777 4.360797 0.240501 20 4.291882 4.358919 0.239816 21 4.281025 4.346954 0.242128 22 4.27367 4.337309 0.243213 23 4.240626 4.327436 0.244454 24 4.203354 4.322042 0.245484 25 4.24484 4.316995 0.245593 26 4.242165 4.313355 0.246129 27 4.175661 4.311628 0.246528 28 4.162489 4.308656 0.247344 29 4.17869 4.30674 0.247567
It seems to keep improving the longer I learn, but I can’t let it learn for too long, because I still need to use this computer to work, which I can’t while it’s doing the learning…
Thanks for your post, it’s fascinating that your model is generating such coherent text!
I’m trying something similar but experiencing painfully slow training time using the AWD LSTM base model. I collected a huge set (5.5 billion tokens) of medical text, but quickly found that training on that much would literally take months. I culled the set down to 250 million tokens, but training is still 8.5 hr/epoch on an AWS p2 EC2 instance. I was curious how large your training corpus is.
I read this post by Jeremy (Language Model Zoo 🦍) that said 100 million tokens is the most our LMs should need, but I feel like a highly technical corpus might necessitate a larger corpus.
Why did you pick that? That’s much higher than the charts suggest or we ever use in the course - you might want to re-watch that part of the lesson. Try an LR of 1.0 perhaps.
So I ran lr_find instead of lr_find2 this time and if I read the graph correctly and follow your advice from the lesson of taking the point with the highest learning rate, where the loss is still strongly decreasing then following this graph:
The choice would be around 10e1, right? But doesn’t that seem absurdly high?
Just in case it’s useful to anyone here, I’ve uploaded here the notebook I used for the new pretrained model on wikitext-103 in fastai_v1. It’s not using the latest refactoring in fastai.text, but it can give you an idea of the hyper-parameters that were picked.
In particular, dropout can be really low since the corpus is so large (it can even be 0. on qrnns).
Not sure why you’re seeing two separate sections in the graph, but I’d guess the first drop is best - i.e. 1e-2. You could try both and report back on your results.
1e-2 after one epoch:
epoch trn_loss val_loss accuracy 0 6.783727 6.758687 0.045085
10e1 after one epoch:
epoch trn_loss val_loss accuracy 0 4.715281 4.639458 0.223043
Sadly I can’t compare for a full run, cause it’ll take me a whole day of not being able to work
I’m trying to run another test with Adam instead of SGD, but when I use use_wd_sched=True instead of use_clr_beta I run out of memory after just 20 iterations.
So Adam with clr_beta returns this after one epoch:
epoch trn_loss val_loss accuracy 0 4.694598 4.621761 0.22373
For Adam I got this plot:
And chose 5e-3, which is incidentally the same lr as in sgugger’s updatet notebook. I guess learn.true_wd is the new version of use_wd_sched?
Well I don’t think it would be a problem if you work at the word level not the Chinese-Character level, because at the word level that would be no ambiguity. And there are only a few cases that exist the 1-to-N ambiguity(the N is very small, almost all are 2), a translator between traditional and simplified wouldn’t fail on the few 1-to-N cases.
But considering the fact that there are many words usage are different in mainland, taiwan and honkong, that is they use different words referring to the same thing, e.g. software in mainland it’s ‘软件’, but in taiwan it’s ‘软体’. So i think the problem is not character mapping but the word usage. With this consideration, it’s worth to try training two model, or even 3 model, based on region, because hongkong and taiwan also exist difference.
Not sure if anyone is still working on ULMFit for Japanese at the moment but I had written something when I going through the course. I got it upto a 90% accuracy. I assume it can be improved with some more tuning of the hyperparameters and data cleaning. Here’s the notebook.
That’s a good point. They’ll still be much more similar than different - so I think a combined model followed by regional-specific fine-tuning will likely work even better.
I have started to train an AWD LSTM model using v1 of fastai. While I was completely fascinated by the ease of use (it took, like, 5 lines of code to get started) and flexibility of the framework, I have been running into technical problems. I mostly use default parameters, only tweaking Adam’s betas and the learning rate, my corpus is 110 million tokens split 90/10 into train/validation. The first epoch goes on mostly fine, though memory utilization of GPU is around 99% from start, but when I start another epoch, I get Cuda OOM error. This prevents me from using cyclical learning rates. Sometimes I get OOM at the end of the first epoch. Cutting down on bptt leads to slower convergence (and probably worse outcome).
Did anyone have this problem and found a solution? My setup is a deep learning image on GCP with K80 (12Gb).
I was just looking into doing this for Turkish. Glad to have found this thread.