Language Model Zoo 🦍

I would suggest ensembling the SentencePiece and word-level models (which along with forwards and backwards models mean you’ll be ensembling 4 models in total).

1 Like

Hi @hafidz. Thanks for checking. I should have provided my updates earlier. Here’s the current progress:

  • [DONE] Download and extract Malay Wikipedia corpus
  • [DONE] Process text (clean and tokenize text)
  • [DONE] Create validation set
  • [DONE] Create data loader for training
  • [DONE] Numericalize the text
  • [DONE] Model setup
  • [DONE] Train model
  • [DONE] Evaluate language model
  • [NOT DONE] Fine-tune language model for text classification task
  • [NOT DONE] Build model for text classification
  • [IN-PROGRESS] Find curated or publicly available labelled dataset for Malay corpus
  • [NOT DONE] Create my own dataset by curating and labelling Malay text scrapped from news sites
  • [NOT DONE] Benchmark model for text classification

So, everything is done for language modelling. It took me a while as I am not satisfied with the model performance (perplexity) during the early first few iterations. Currently, I am hitting a roadblock at text classification tasks. Anyway, with that aside, I think the Malay language model is ready to be contributed to the model zoo. So, I will announce this shortly.

Hey folks,

I hope your day is going well. I am happy to contribute Malay language model to the model zoo.

ULMFiT in Malay language

The final validation loss was 3.38 (29.30 perplexity) and the accuracy was around 41% on Malay Wikipedia corpus.


em_sz = 400  # size of each embedding vector
nh = 1150    # number of hidden activations per layer
nl = 3       # number of layers
wd = 1e-7
bptt = 70
bs = 64
opt_fn = partial(optim.SGD, momentum=0.9)
weight_factor = 0.3

drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15]) * weight_factor

learner.clip = 0.2

lr = 8 # learning rate

Training:, 1, wds=wd, cycle_len=10, use_clr=(10,33,0.95,0.85), best_save_name='best_lm_malay_1cycle')

# Training loss history
epoch      trn_loss   val_loss   accuracy                       
    0      4.114716   3.936571   0.367859  
    1      3.83864    3.711561   0.382893                       
    2      3.669321   3.603781   0.391633                       
    3      3.63252    3.560706   0.394518                       
    4      3.478959   3.513905   0.399009                       
    5      3.518267   3.480469   0.401523                       
    6      3.409158   3.465206   0.402808                       
    7      3.426483   3.437133   0.405097                       
    8      3.296175   3.409095   0.409595                       
    9      3.185208   3.377671   0.413643

I have tried to speed up training using Leslie Smith’s work on 1cycle policy that he described the super-convergence phenomenon. The model was trained using an implementation of this method in fastai library—Cyclical Learning Rate (CLR). Interestingly, based on my own experiments and observations with this method, the AWS-LSTM model converged faster, instead of 15 epochs, it took just 10 epochs.

It took me around 1 hour 24 minutes to train 1 epoch on one Tesla K80 GPU. The full training took me around 14 hours.

I think there’s room for further improvements. Next up, I plan to build a Singlish language model. :grin:


Great job @cedric.

1 Like

@cedric, nice to see progress, on the Malay language. Can you explain how many words do you have in your vocabulary?

We should check this model on the downstream tasks, like text classificaiton, the issue with language modelling is that you can have superb perplexity if you have small / too small vocabulary, and this perplexity don’t necessary translate to good performance of downstream tasks.

If you can find any competition for Malay, try what we are doing for Polish:

  • Find any text corpus that you can classify: eg.

    • Newspaper Articles: Bussines, Politics, Sport, Fashion etc…
    • Sentiment on user comments (we are working with polish version of goodreads to obtain comments)
    • Worst case just classify if something is from Wikipedia or from Newspaper
  • Since such data set would be new you won’t know SOTA but applying text classification without pretrating and with pretaring will give you a good baseline it will show how pretraining help.

Besides google recently release data set search, you may try to find Malay there:

@cedric, @hafidz hat would you say for creating a separate thread for Malay and discuss this there, that way you will have history of your work in one place easy to check have a look how it works for german ULMFiT - German


Great idea. Would love to contribute if there’s more tasks to be done. I’ve created a page for Malay for further discussion for anyone who’s interested. ULMFiT - Malay

1 Like

Hey, thank you for your reply and your efforts on organizing this thread. Nice job.

60, 000.

Thanks for the tips. I agree. I will work on the downstream tasks soon.

OK, I will take a look.

I found one or two small text corpus hidden in some academic papers published by Malaysia’s local universities. Still evaluating if this corpus is suitable for building and training the model. Good thing is, there’s already an existing benchmark so I can compare my model against it.

Yes, I am aware of Google Data Set Search and have tried finding there and found nothing :slight_smile:

1 Like

Does any of you have tips on how to further reduce my training loss & Accuracy?

I’m currently creating a language model based on the sentiment140 twitter dataset. I already tried varying vocabulary sizes (50k, 25k) Adam optim with low lr, SGD with momentum and high lr, different embedding sizes, hidden layer sizes, batch sizes, different loss multiplication ratios, but no matter what I try I can’t seem to get it below a value of 0.417, but this took 30 epochs. While I see you guys easily getting below 0.4 in just 2 epochs.
My dataset has these properties:
trainingset length: 1.440.000
validation length: 160.000
unique words: 321.439
max vocab used: 50.000/25.000 (min freq. of 4 returns 52k~)
len(np.concatenate(trn_lm)): 22.498.795

chunksize: 50.000
em_sz,nh,nl: 400,1100,3 (Would smaller sizes be better for smaller datasets?)
bptt: 70
bs: 50
opt_fn: optim.SGD, momentum=0.9
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15], dtype=“f”)*0.5

Following these I decided to use an lr around 18, but the accuracy seems to fall with rising lr, so should I stick to something between 0 and 2.5 then?

the resulting training looks like this:

epoch      trn_loss   val_loss   accuracy                                                                              
    0      4.885424   4.765574   0.204343  
    1      4.68609    4.581247   0.219108                                                                              
    2      4.588282   4.500839   0.226138                                                                              
    3      4.54668    4.470822   0.227034                                                                              
    4      4.514667   4.445856   0.229765                                                                              
    5      4.476595   4.433705   0.23107                                                                               
    6      4.479217   4.425251   0.231592                                                                              
    7      4.452099   4.431449   0.230048                                                                              
    8      4.44206    4.419237   0.232063                                                                              
    9      4.436647   4.417188   0.232431                                                                              
    10     4.43317    4.412861   0.232667                                                                              
    11     4.422395   4.413309   0.232941                                                                              
    12     4.414105   4.402681   0.234613                                                                              
    13     4.425107   4.39716    0.234751                                                                              
    14     4.387628   4.395168   0.235595                                                                              
    15     4.402883   4.386707   0.235551                                                                              
    16     4.363533   4.378289   0.238221                                                                              
    17     4.357185   4.37697    0.237533                                                                              
    18     4.367101   4.368633   0.237971                                                                              
    19     4.313777   4.360797   0.240501                                                                              
    20     4.291882   4.358919   0.239816                                                                              
    21     4.281025   4.346954   0.242128                                                                              
    22     4.27367    4.337309   0.243213                                                                              
    23     4.240626   4.327436   0.244454                                                                              
    24     4.203354   4.322042   0.245484                                                                              
    25     4.24484    4.316995   0.245593                                                                              
    26     4.242165   4.313355   0.246129                                                                              
    27     4.175661   4.311628   0.246528                                                                              
    28     4.162489   4.308656   0.247344                                                                              
    29     4.17869    4.30674    0.247567

It seems to keep improving the longer I learn, but I can’t let it learn for too long, because I still need to use this computer to work, which I can’t while it’s doing the learning… :confused:

Hi Christine,

Thanks for your post, it’s fascinating that your model is generating such coherent text!

I’m trying something similar but experiencing painfully slow training time using the AWD LSTM base model. I collected a huge set (5.5 billion tokens) of medical text, but quickly found that training on that much would literally take months. I culled the set down to 250 million tokens, but training is still 8.5 hr/epoch on an AWS p2 EC2 instance. I was curious how large your training corpus is.

I read this post by Jeremy (Language Model Zoo 🦍) that said 100 million tokens is the most our LMs should need, but I feel like a highly technical corpus might necessitate a larger corpus.


Why did you pick that? That’s much higher than the charts suggest or we ever use in the course - you might want to re-watch that part of the lesson. Try an LR of 1.0 perhaps.

So I ran lr_find instead of lr_find2 this time and if I read the graph correctly and follow your advice from the lesson of taking the point with the highest learning rate, where the loss is still strongly decreasing then following this graph:
The choice would be around 10e1, right? But doesn’t that seem absurdly high?

Just in case it’s useful to anyone here, I’ve uploaded here the notebook I used for the new pretrained model on wikitext-103 in fastai_v1. It’s not using the latest refactoring in fastai.text, but it can give you an idea of the hyper-parameters that were picked.
In particular, dropout can be really low since the corpus is so large (it can even be 0. on qrnns).


Not sure why you’re seeing two separate sections in the graph, but I’d guess the first drop is best - i.e. 1e-2. You could try both and report back on your results.

1e-2 after one epoch:

epoch      trn_loss   val_loss   accuracy                                                                              
    0      6.783727   6.758687   0.045085  

10e1 after one epoch:

epoch      trn_loss   val_loss   accuracy                                                                              
    0      4.715281   4.639458   0.223043  

Sadly I can’t compare for a full run, cause it’ll take me a whole day of not being able to work :confused:
I’m trying to run another test with Adam instead of SGD, but when I use use_wd_sched=True instead of use_clr_beta I run out of memory after just 20 iterations.
So Adam with clr_beta returns this after one epoch:

epoch      trn_loss   val_loss   accuracy                                                                              
    0      4.694598   4.621761   0.22373  

For Adam I got this plot:
And chose 5e-3, which is incidentally the same lr as in sgugger’s updatet notebook. I guess learn.true_wd is the new version of use_wd_sched?

Well I don’t think it would be a problem if you work at the word level not the Chinese-Character level, because at the word level that would be no ambiguity. And there are only a few cases that exist the 1-to-N ambiguity(the N is very small, almost all are 2), a translator between traditional and simplified wouldn’t fail on the few 1-to-N cases.
But considering the fact that there are many words usage are different in mainland, taiwan and honkong, that is they use different words referring to the same thing, e.g. software in mainland it’s ‘软件’, but in taiwan it’s ‘软体’. So i think the problem is not character mapping but the word usage. With this consideration, it’s worth to try training two model, or even 3 model, based on region, because hongkong and taiwan also exist difference.

quite erudite!

Not sure if anyone is still working on ULMFit for Japanese at the moment but I had written something when I going through the course. I got it upto a 90% accuracy. I assume it can be improved with some more tuning of the hyperparameters and data cleaning. Here’s the notebook.


That’s a good point. They’ll still be much more similar than different - so I think a combined model followed by regional-specific fine-tuning will likely work even better.


Hi everyone!
I have started to train an AWD LSTM model using v1 of fastai. While I was completely fascinated by the ease of use (it took, like, 5 lines of code to get started) and flexibility of the framework, I have been running into technical problems. I mostly use default parameters, only tweaking Adam’s betas and the learning rate, my corpus is 110 million tokens split 90/10 into train/validation. The first epoch goes on mostly fine, though memory utilization of GPU is around 99% from start, but when I start another epoch, I get Cuda OOM error. This prevents me from using cyclical learning rates. Sometimes I get OOM at the end of the first epoch. Cutting down on bptt leads to slower convergence (and probably worse outcome).
Did anyone have this problem and found a solution? My setup is a deep learning image on GCP with K80 (12Gb).

I was just looking into doing this for Turkish. Glad to have found this thread.

1 Like