Language Model Zoo šŸ¦

Hi @pandeyanil

I can help you with Hindi and Sanskrit.

Could you please guide me on how to start?

@shankarj67 if you havenā€™t yet check out the http://course.fast.ai/lessons/lesson10.html , Jeremy shows there how to train and use the language models.
Once you are ready to start there are also scripts that Jeremy and Sebastian created for the ablation studies, they are quite useful as with just command line parameter changes you can train your model, they have pretty good documentation here:
https://github.com/fastai/fastai/blob/master/courses/dl2/imdb_scripts/README.md

@t-v, @MatthiasBachfischer, @elyase, @rother, @aayushy

GermEval 2018 has some pretty well-suited tasks for ULMFiT : Classification and Fine Grain Classification. In case you arenā€™t taking part in the competition already, we can train the ULMFiT with sentence piece on the competition data and we will be able to compare the results on September 21 (the workshop day).

If you took part in the competition and won, can you share your paper or provide an appropriate citation?
We won 3rd task in poleval 2018 using ULMFiT with SentencePiece as the tokenization, unfortunately, the task was just about creating language model so we couldnā€™t use the transfer learning. Iā€™m looking for an example where Sentence Piece + ULMFiT achieve SOTA in down-stream tasks to justify our claims in the paper.

1 Like

If you take part in any competition with your LMā€™s one thing that helped us the most was to try many different parameters on a very small corpus (10M of tokens) thanks to this we could check 53 combinations in just under a day.

1 Like

Iā€™ve been training a LM on clinical/medical text using the MIMIC-III database and things have been going really well. The initial model I completed today (~13 hours of training time) had a perplexity of ~15 on the validation set, with an average accuracy of 60% in predicting the next word.

The initial model is a word-level model that uses the tokenization methods from the course, this will be my base that Iā€™ll use to compare different tokenization methods/hyper-parameters against.

The initial results seems too good to be true to me, so Iā€™ll be digging into it a bit more to see if thereā€™s some area where Iā€™m allowing for information leakage, or if itā€™s gotten really good at predicting nonsense (for example thereā€™s a lot of upper case in my corpus, so I wonder if itā€™s gotten really good at predicting the uppercase token). Iā€™ll need to do some more research as well to see if thereā€™s published papers that I can compare results against.

All in all itā€™s pretty amazing how quickly Iā€™ve been able to set this up and get things running, thanks to everyone in this thread for sharing their work and thoughts. Iā€™m writing up a blog post about what Iā€™m currently doing and will share soon as well.

4 Likes

We have an entry for Germeval (binary task only) but Iā€™m fairly confident that it is not that great. Unfortunately I saw the competition late and had a very heavy workload towards the end that clashed a bit with doing more. Additionally there were some technical difficulties towards the end (heatwave in Germany + computers that crunch for 3-4 days = bad combination). We deliberately kept it very vanilla ULMFiT so I just used a 50k token Wiki German LM, about 300k self collected unlabeled Tweets and just the provided training data. No ensembling. The LM and the Twitter model are pretty decent I think (<28 perplexity and <18 perplexity respectively). The classifier eventually converged (I underestimated this step) and we got an F1 of about 0.8 on the validation set which Iā€™d been very happy with but a rather disappointing score for the test set. Iā€™ll discuss the final results after the event (itā€™s this weekend). If anyone else from these forums attends, shoot me a PM and letā€™s meet/talk :slight_smile:

Even with the very hectic finish Iā€™d do it again. Very many lessons learned. Iā€™m confident that the results can be improved a good bit and have some ideas but little time :slight_smile:

2 Likes

Letā€™s clean up and get ULMFiT working on our languages

Jeremy gave us an excellent opportunity to deliver very tangible results and learn along the way. But It is up to us to get our selves together and produce working models.

I know that ULMFiT is a beast (sometimes), you need tones of memory and it taking a full day of worming your room just to see that the language model isnā€™t as good as you wanted. I get it, but it is how deep learning usually feels :slight_smile: if it was easy there wouldnā€™t be any fun in doing this.

But we are so close. Letā€™s get it done!

How about a multiple self-support groups?

I mean a chat where are ppl that work on the same language model. People that care that your model got perplexity of 60, and they understand if that is good or bad. And can offer you an emoji or an animated gif.

A support group == a thread for each language.

If you are in, vote below to join a language group and start training.
The first person that votes should create a thread and link it to the first post above (you have 3 votes):

  • Bengali
  • Chinese (Simplified)
  • Chinese (Traditional)
  • Danish
  • Esperanto
  • Estonian
  • Finnish
  • French
  • German
  • Hebrew
  • Hindi
  • Italian
  • Indonesian
  • Japanese
  • Korean

0 voters

  • Malay
  • Malayalam
  • ** Medical
  • ** Music (generating music in the style of Mozart & Brahms)
  • Norwegian
  • Polish
  • Portuguese
  • Russian
  • Sanskrit
  • Spanish
  • Swahili
  • Swedish
  • Tamil
  • Telugu
  • Thai
  • ** isiXhosa4

0 voters

Kristian, that is a lot of work. German is a very well supported language so the competition is strong. Getting the additional twitter data was a smart move.

If you want to team up and still try to beat the SOTA we can work together?
I have a working sentence piece implementation it could address the very long words that German sometimes and you have the additional data, maybe this will help?

If you are in vote on a little pool above here and I will create a separate thread to share the results and to work together. Then we can divide the work and start experimenting. It is after the competition so there is no need for secrecy and the work can be public.

1 Like

hi @cedric. may i know how far have you gone into the malay model?

I would suggest ensembling the SentencePiece and word-level models (which along with forwards and backwards models mean youā€™ll be ensembling 4 models in total).

1 Like

Hi @hafidz. Thanks for checking. I should have provided my updates earlier. Hereā€™s the current progress:

  • [DONE] Download and extract Malay Wikipedia corpus
  • [DONE] Process text (clean and tokenize text)
  • [DONE] Create validation set
  • [DONE] Create data loader for training
  • [DONE] Numericalize the text
  • [DONE] Model setup
  • [DONE] Train model
  • [DONE] Evaluate language model
  • [NOT DONE] Fine-tune language model for text classification task
  • [NOT DONE] Build model for text classification
  • [IN-PROGRESS] Find curated or publicly available labelled dataset for Malay corpus
  • [NOT DONE] Create my own dataset by curating and labelling Malay text scrapped from news sites
  • [NOT DONE] Benchmark model for text classification

So, everything is done for language modelling. It took me a while as I am not satisfied with the model performance (perplexity) during the early first few iterations. Currently, I am hitting a roadblock at text classification tasks. Anyway, with that aside, I think the Malay language model is ready to be contributed to the model zoo. So, I will announce this shortly.

Hey folks,

I hope your day is going well. I am happy to contribute Malay language model to the model zoo.

ULMFiT in Malay language

The final validation loss was 3.38 (29.30 perplexity) and the accuracy was around 41% on Malay Wikipedia corpus.

Hyper-parameters:

em_sz = 400  # size of each embedding vector
nh = 1150    # number of hidden activations per layer
nl = 3       # number of layers
wd = 1e-7
bptt = 70
bs = 64
opt_fn = partial(optim.SGD, momentum=0.9)
weight_factor = 0.3

drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15]) * weight_factor

learner.clip = 0.2

lr = 8 # learning rate

Training:

learner.fit(lr, 1, wds=wd, cycle_len=10, use_clr=(10,33,0.95,0.85), best_save_name='best_lm_malay_1cycle')

# Training loss history
epoch      trn_loss   val_loss   accuracy                       
    0      4.114716   3.936571   0.367859  
    1      3.83864    3.711561   0.382893                       
    2      3.669321   3.603781   0.391633                       
    3      3.63252    3.560706   0.394518                       
    4      3.478959   3.513905   0.399009                       
    5      3.518267   3.480469   0.401523                       
    6      3.409158   3.465206   0.402808                       
    7      3.426483   3.437133   0.405097                       
    8      3.296175   3.409095   0.409595                       
    9      3.185208   3.377671   0.413643

I have tried to speed up training using Leslie Smithā€™s work on 1cycle policy that he described the super-convergence phenomenon. The model was trained using an implementation of this method in fastai libraryā€”Cyclical Learning Rate (CLR). Interestingly, based on my own experiments and observations with this method, the AWS-LSTM model converged faster, instead of 15 epochs, it took just 10 epochs.

It took me around 1 hour 24 minutes to train 1 epoch on one Tesla K80 GPU. The full training took me around 14 hours.

I think thereā€™s room for further improvements. Next up, I plan to build a Singlish language model. :grin:

5 Likes

Great job @cedric.

1 Like

@cedric, nice to see progress, on the Malay language. Can you explain how many words do you have in your vocabulary?

We should check this model on the downstream tasks, like text classificaiton, the issue with language modelling is that you can have superb perplexity if you have small / too small vocabulary, and this perplexity donā€™t necessary translate to good performance of downstream tasks.

If you can find any competition for Malay, try what we are doing for Polish:

  • Find any text corpus that you can classify: eg.

    • Newspaper Articles: Bussines, Politics, Sport, Fashion etcā€¦
    • Sentiment on user comments (we are working with polish version of goodreads to obtain comments)
    • Worst case just classify if something is from Wikipedia or from Newspaper
  • Since such data set would be new you wonā€™t know SOTA but applying text classification without pretrating and with pretaring will give you a good baseline it will show how pretraining help.

Besides google recently release data set search, you may try to find Malay there:
https://toolbox.google.com/datasetsearch

@cedric, @hafidz hat would you say for creating a separate thread for Malay and discuss this there, that way you will have history of your work in one place easy to check have a look how it works for german ULMFiT - German

4 Likes

Great idea. Would love to contribute if thereā€™s more tasks to be done. Iā€™ve created a page for Malay for further discussion for anyone whoā€™s interested. ULMFiT - Malay

1 Like

Hey, thank you for your reply and your efforts on organizing this thread. Nice job.

60, 000.

Thanks for the tips. I agree. I will work on the downstream tasks soon.

OK, I will take a look.

I found one or two small text corpus hidden in some academic papers published by Malaysiaā€™s local universities. Still evaluating if this corpus is suitable for building and training the model. Good thing is, thereā€™s already an existing benchmark so I can compare my model against it.

Yes, I am aware of Google Data Set Search and have tried finding there and found nothing :slight_smile:

1 Like

Does any of you have tips on how to further reduce my training loss & Accuracy?

Iā€™m currently creating a language model based on the sentiment140 twitter dataset. I already tried varying vocabulary sizes (50k, 25k) Adam optim with low lr, SGD with momentum and high lr, different embedding sizes, hidden layer sizes, batch sizes, different loss multiplication ratios, but no matter what I try I canā€™t seem to get it below a value of 0.417, but this took 30 epochs. While I see you guys easily getting below 0.4 in just 2 epochs.
My dataset has these properties:
trainingset length: 1.440.000
validation length: 160.000
unique words: 321.439
max vocab used: 50.000/25.000 (min freq. of 4 returns 52k~)
len(np.concatenate(trn_lm)): 22.498.795

settings:
chunksize: 50.000
em_sz,nh,nl: 400,1100,3 (Would smaller sizes be better for smaller datasets?)
bptt: 70
bs: 50
opt_fn: optim.SGD, momentum=0.9
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15], dtype=ā€œfā€)*0.5
use_clr_beta=(10,20,0.95,0.85)

Following these I decided to use an lr around 18, but the accuracy seems to fall with rising lr, so should I stick to something between 0 and 2.5 then?

the resulting training looks like this:

epoch      trn_loss   val_loss   accuracy                                                                              
    0      4.885424   4.765574   0.204343  
    1      4.68609    4.581247   0.219108                                                                              
    2      4.588282   4.500839   0.226138                                                                              
    3      4.54668    4.470822   0.227034                                                                              
    4      4.514667   4.445856   0.229765                                                                              
    5      4.476595   4.433705   0.23107                                                                               
    6      4.479217   4.425251   0.231592                                                                              
    7      4.452099   4.431449   0.230048                                                                              
    8      4.44206    4.419237   0.232063                                                                              
    9      4.436647   4.417188   0.232431                                                                              
    10     4.43317    4.412861   0.232667                                                                              
    11     4.422395   4.413309   0.232941                                                                              
    12     4.414105   4.402681   0.234613                                                                              
    13     4.425107   4.39716    0.234751                                                                              
    14     4.387628   4.395168   0.235595                                                                              
    15     4.402883   4.386707   0.235551                                                                              
    16     4.363533   4.378289   0.238221                                                                              
    17     4.357185   4.37697    0.237533                                                                              
    18     4.367101   4.368633   0.237971                                                                              
    19     4.313777   4.360797   0.240501                                                                              
    20     4.291882   4.358919   0.239816                                                                              
    21     4.281025   4.346954   0.242128                                                                              
    22     4.27367    4.337309   0.243213                                                                              
    23     4.240626   4.327436   0.244454                                                                              
    24     4.203354   4.322042   0.245484                                                                              
    25     4.24484    4.316995   0.245593                                                                              
    26     4.242165   4.313355   0.246129                                                                              
    27     4.175661   4.311628   0.246528                                                                              
    28     4.162489   4.308656   0.247344                                                                              
    29     4.17869    4.30674    0.247567

It seems to keep improving the longer I learn, but I canā€™t let it learn for too long, because I still need to use this computer to work, which I canā€™t while itā€™s doing the learningā€¦ :confused:

Hi Christine,

Thanks for your post, itā€™s fascinating that your model is generating such coherent text!

Iā€™m trying something similar but experiencing painfully slow training time using the AWD LSTM base model. I collected a huge set (5.5 billion tokens) of medical text, but quickly found that training on that much would literally take months. I culled the set down to 250 million tokens, but training is still 8.5 hr/epoch on an AWS p2 EC2 instance. I was curious how large your training corpus is.

I read this post by Jeremy (Language Model Zoo šŸ¦) that said 100 million tokens is the most our LMs should need, but I feel like a highly technical corpus might necessitate a larger corpus.

Thanks!
-Bill

Why did you pick that? Thatā€™s much higher than the charts suggest or we ever use in the course - you might want to re-watch that part of the lesson. Try an LR of 1.0 perhaps.

So I ran lr_find instead of lr_find2 this time and if I read the graph correctly and follow your advice from the lesson of taking the point with the highest learning rate, where the loss is still strongly decreasing then following this graph:
image
The choice would be around 10e1, right? But doesnā€™t that seem absurdly high?