hi @cedric. may i know how far have you gone into the malay model?
I would suggest ensembling the SentencePiece and word-level models (which along with forwards and backwards models mean youāll be ensembling 4 models in total).
Hi @hafidz. Thanks for checking. I should have provided my updates earlier. Hereās the current progress:
- [DONE] Download and extract Malay Wikipedia corpus
- [DONE] Process text (clean and tokenize text)
- [DONE] Create validation set
- [DONE] Create data loader for training
- [DONE] Numericalize the text
- [DONE] Model setup
- [DONE] Train model
- [DONE] Evaluate language model
- [NOT DONE] Fine-tune language model for text classification task
- [NOT DONE] Build model for text classification
- [IN-PROGRESS] Find curated or publicly available labelled dataset for Malay corpus
- [NOT DONE] Create my own dataset by curating and labelling Malay text scrapped from news sites
- [NOT DONE] Benchmark model for text classification
So, everything is done for language modelling. It took me a while as I am not satisfied with the model performance (perplexity) during the early first few iterations. Currently, I am hitting a roadblock at text classification tasks. Anyway, with that aside, I think the Malay language model is ready to be contributed to the model zoo. So, I will announce this shortly.
Hey folks,
I hope your day is going well. I am happy to contribute Malay language model to the model zoo.
ULMFiT in Malay language
The final validation loss was 3.38 (29.30 perplexity) and the accuracy was around 41% on Malay Wikipedia corpus.
- Source code (shell script, Jupyter notebook)
- Pre-trained model weights
- Pre-processed training dataset of Malay Wikipedia
Hyper-parameters:
em_sz = 400 # size of each embedding vector
nh = 1150 # number of hidden activations per layer
nl = 3 # number of layers
wd = 1e-7
bptt = 70
bs = 64
opt_fn = partial(optim.SGD, momentum=0.9)
weight_factor = 0.3
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15]) * weight_factor
learner.clip = 0.2
lr = 8 # learning rate
Training:
learner.fit(lr, 1, wds=wd, cycle_len=10, use_clr=(10,33,0.95,0.85), best_save_name='best_lm_malay_1cycle')
# Training loss history
epoch trn_loss val_loss accuracy
0 4.114716 3.936571 0.367859
1 3.83864 3.711561 0.382893
2 3.669321 3.603781 0.391633
3 3.63252 3.560706 0.394518
4 3.478959 3.513905 0.399009
5 3.518267 3.480469 0.401523
6 3.409158 3.465206 0.402808
7 3.426483 3.437133 0.405097
8 3.296175 3.409095 0.409595
9 3.185208 3.377671 0.413643
I have tried to speed up training using Leslie Smithās work on 1cycle policy that he described the super-convergence phenomenon. The model was trained using an implementation of this method in fastai libraryāCyclical Learning Rate (CLR). Interestingly, based on my own experiments and observations with this method, the AWS-LSTM model converged faster, instead of 15 epochs, it took just 10 epochs.
It took me around 1 hour 24 minutes to train 1 epoch on one Tesla K80 GPU. The full training took me around 14 hours.
I think thereās room for further improvements. Next up, I plan to build a Singlish language model.
@cedric, nice to see progress, on the Malay language. Can you explain how many words do you have in your vocabulary?
We should check this model on the downstream tasks, like text classificaiton, the issue with language modelling is that you can have superb perplexity if you have small / too small vocabulary, and this perplexity donāt necessary translate to good performance of downstream tasks.
If you can find any competition for Malay, try what we are doing for Polish:
-
Find any text corpus that you can classify: eg.
- Newspaper Articles: Bussines, Politics, Sport, Fashion etcā¦
- Sentiment on user comments (we are working with polish version of goodreads to obtain comments)
- Worst case just classify if something is from Wikipedia or from Newspaper
-
Since such data set would be new you wonāt know SOTA but applying text classification without pretrating and with pretaring will give you a good baseline it will show how pretraining help.
Besides google recently release data set search, you may try to find Malay there:
https://toolbox.google.com/datasetsearch
@cedric, @hafidz hat would you say for creating a separate thread for Malay and discuss this there, that way you will have history of your work in one place easy to check have a look how it works for german ULMFiT - German
Great idea. Would love to contribute if thereās more tasks to be done. Iāve created a page for Malay for further discussion for anyone whoās interested. ULMFiT - Malay
Hey, thank you for your reply and your efforts on organizing this thread. Nice job.
60, 000.
Thanks for the tips. I agree. I will work on the downstream tasks soon.
OK, I will take a look.
I found one or two small text corpus hidden in some academic papers published by Malaysiaās local universities. Still evaluating if this corpus is suitable for building and training the model. Good thing is, thereās already an existing benchmark so I can compare my model against it.
Yes, I am aware of Google Data Set Search and have tried finding there and found nothing
Does any of you have tips on how to further reduce my training loss & Accuracy?
Iām currently creating a language model based on the sentiment140 twitter dataset. I already tried varying vocabulary sizes (50k, 25k) Adam optim with low lr, SGD with momentum and high lr, different embedding sizes, hidden layer sizes, batch sizes, different loss multiplication ratios, but no matter what I try I canāt seem to get it below a value of 0.417, but this took 30 epochs. While I see you guys easily getting below 0.4 in just 2 epochs.
My dataset has these properties:
trainingset length: 1.440.000
validation length: 160.000
unique words: 321.439
max vocab used: 50.000/25.000 (min freq. of 4 returns 52k~)
len(np.concatenate(trn_lm)): 22.498.795
settings:
chunksize: 50.000
em_sz,nh,nl: 400,1100,3 (Would smaller sizes be better for smaller datasets?)
bptt: 70
bs: 50
opt_fn: optim.SGD, momentum=0.9
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15], dtype=āfā)*0.5
use_clr_beta=(10,20,0.95,0.85)
Following these I decided to use an lr around 18, but the accuracy seems to fall with rising lr, so should I stick to something between 0 and 2.5 then?
the resulting training looks like this:
epoch trn_loss val_loss accuracy
0 4.885424 4.765574 0.204343
1 4.68609 4.581247 0.219108
2 4.588282 4.500839 0.226138
3 4.54668 4.470822 0.227034
4 4.514667 4.445856 0.229765
5 4.476595 4.433705 0.23107
6 4.479217 4.425251 0.231592
7 4.452099 4.431449 0.230048
8 4.44206 4.419237 0.232063
9 4.436647 4.417188 0.232431
10 4.43317 4.412861 0.232667
11 4.422395 4.413309 0.232941
12 4.414105 4.402681 0.234613
13 4.425107 4.39716 0.234751
14 4.387628 4.395168 0.235595
15 4.402883 4.386707 0.235551
16 4.363533 4.378289 0.238221
17 4.357185 4.37697 0.237533
18 4.367101 4.368633 0.237971
19 4.313777 4.360797 0.240501
20 4.291882 4.358919 0.239816
21 4.281025 4.346954 0.242128
22 4.27367 4.337309 0.243213
23 4.240626 4.327436 0.244454
24 4.203354 4.322042 0.245484
25 4.24484 4.316995 0.245593
26 4.242165 4.313355 0.246129
27 4.175661 4.311628 0.246528
28 4.162489 4.308656 0.247344
29 4.17869 4.30674 0.247567
It seems to keep improving the longer I learn, but I canāt let it learn for too long, because I still need to use this computer to work, which I canāt while itās doing the learningā¦
Hi Christine,
Thanks for your post, itās fascinating that your model is generating such coherent text!
Iām trying something similar but experiencing painfully slow training time using the AWD LSTM base model. I collected a huge set (5.5 billion tokens) of medical text, but quickly found that training on that much would literally take months. I culled the set down to 250 million tokens, but training is still 8.5 hr/epoch on an AWS p2 EC2 instance. I was curious how large your training corpus is.
I read this post by Jeremy (Language Model Zoo š¦) that said 100 million tokens is the most our LMs should need, but I feel like a highly technical corpus might necessitate a larger corpus.
Thanks!
-Bill
Why did you pick that? Thatās much higher than the charts suggest or we ever use in the course - you might want to re-watch that part of the lesson. Try an LR of 1.0 perhaps.
So I ran lr_find instead of lr_find2 this time and if I read the graph correctly and follow your advice from the lesson of taking the point with the highest learning rate, where the loss is still strongly decreasing then following this graph:
The choice would be around 10e1, right? But doesnāt that seem absurdly high?
Just in case itās useful to anyone here, Iāve uploaded here the notebook I used for the new pretrained model on wikitext-103 in fastai_v1. Itās not using the latest refactoring in fastai.text, but it can give you an idea of the hyper-parameters that were picked.
In particular, dropout can be really low since the corpus is so large (it can even be 0. on qrnns).
Not sure why youāre seeing two separate sections in the graph, but Iād guess the first drop is best - i.e. 1e-2. You could try both and report back on your results.
1e-2 after one epoch:
epoch trn_loss val_loss accuracy
0 6.783727 6.758687 0.045085
10e1 after one epoch:
epoch trn_loss val_loss accuracy
0 4.715281 4.639458 0.223043
Sadly I canāt compare for a full run, cause itāll take me a whole day of not being able to work
Iām trying to run another test with Adam instead of SGD, but when I use use_wd_sched=True instead of use_clr_beta I run out of memory after just 20 iterations.
So Adam with clr_beta returns this after one epoch:
epoch trn_loss val_loss accuracy
0 4.694598 4.621761 0.22373
For Adam I got this plot:
And chose 5e-3, which is incidentally the same lr as in sguggerās updatet notebook. I guess learn.true_wd is the new version of use_wd_sched?
Well I donāt think it would be a problem if you work at the word level not the Chinese-Character level, because at the word level that would be no ambiguity. And there are only a few cases that exist the 1-to-N ambiguity(the N is very small, almost all are 2), a translator between traditional and simplified wouldnāt fail on the few 1-to-N cases.
But considering the fact that there are many words usage are different in mainland, taiwan and honkong, that is they use different words referring to the same thing, e.g. software in mainland itās āč½Æ件ā, but in taiwan itās āč½Æä½ā. So i think the problem is not character mapping but the word usage. With this consideration, itās worth to try training two model, or even 3 model, based on region, because hongkong and taiwan also exist difference.
quite erudite!
Not sure if anyone is still working on ULMFit for Japanese at the moment but I had written something when I going through the course. I got it upto a 90% accuracy. I assume it can be improved with some more tuning of the hyperparameters and data cleaning. Hereās the notebook.
Thatās a good point. Theyāll still be much more similar than different - so I think a combined model followed by regional-specific fine-tuning will likely work even better.
Hi everyone!
I have started to train an AWD LSTM model using v1 of fastai. While I was completely fascinated by the ease of use (it took, like, 5 lines of code to get started) and flexibility of the framework, I have been running into technical problems. I mostly use default parameters, only tweaking Adamās betas and the learning rate, my corpus is 110 million tokens split 90/10 into train/validation. The first epoch goes on mostly fine, though memory utilization of GPU is around 99% from start, but when I start another epoch, I get Cuda OOM error. This prevents me from using cyclical learning rates. Sometimes I get OOM at the end of the first epoch. Cutting down on bptt leads to slower convergence (and probably worse outcome).
Did anyone have this problem and found a solution? My setup is a deep learning image on GCP with K80 (12Gb).