Hi @pandeyanil
I can help you with Hindi and Sanskrit.
Could you please guide me on how to start?
@shankarj67 if you havenāt yet check out the http://course.fast.ai/lessons/lesson10.html , Jeremy shows there how to train and use the language models.
Once you are ready to start there are also scripts that Jeremy and Sebastian created for the ablation studies, they are quite useful as with just command line parameter changes you can train your model, they have pretty good documentation here:
https://github.com/fastai/fastai/blob/master/courses/dl2/imdb_scripts/README.md
@t-v, @MatthiasBachfischer, @elyase, @rother, @aayushy
GermEval 2018 has some pretty well-suited tasks for ULMFiT : Classification and Fine Grain Classification. In case you arenāt taking part in the competition already, we can train the ULMFiT with sentence piece on the competition data and we will be able to compare the results on September 21 (the workshop day).
If you took part in the competition and won, can you share your paper or provide an appropriate citation?
We won 3rd task in poleval 2018 using ULMFiT with SentencePiece as the tokenization, unfortunately, the task was just about creating language model so we couldnāt use the transfer learning. Iām looking for an example where Sentence Piece + ULMFiT achieve SOTA in down-stream tasks to justify our claims in the paper.
If you take part in any competition with your LMās one thing that helped us the most was to try many different parameters on a very small corpus (10M of tokens) thanks to this we could check 53 combinations in just under a day.
Iāve been training a LM on clinical/medical text using the MIMIC-III database and things have been going really well. The initial model I completed today (~13 hours of training time) had a perplexity of ~15 on the validation set, with an average accuracy of 60% in predicting the next word.
The initial model is a word-level model that uses the tokenization methods from the course, this will be my base that Iāll use to compare different tokenization methods/hyper-parameters against.
The initial results seems too good to be true to me, so Iāll be digging into it a bit more to see if thereās some area where Iām allowing for information leakage, or if itās gotten really good at predicting nonsense (for example thereās a lot of upper case in my corpus, so I wonder if itās gotten really good at predicting the uppercase token). Iāll need to do some more research as well to see if thereās published papers that I can compare results against.
All in all itās pretty amazing how quickly Iāve been able to set this up and get things running, thanks to everyone in this thread for sharing their work and thoughts. Iām writing up a blog post about what Iām currently doing and will share soon as well.
We have an entry for Germeval (binary task only) but Iām fairly confident that it is not that great. Unfortunately I saw the competition late and had a very heavy workload towards the end that clashed a bit with doing more. Additionally there were some technical difficulties towards the end (heatwave in Germany + computers that crunch for 3-4 days = bad combination). We deliberately kept it very vanilla ULMFiT so I just used a 50k token Wiki German LM, about 300k self collected unlabeled Tweets and just the provided training data. No ensembling. The LM and the Twitter model are pretty decent I think (<28 perplexity and <18 perplexity respectively). The classifier eventually converged (I underestimated this step) and we got an F1 of about 0.8 on the validation set which Iād been very happy with but a rather disappointing score for the test set. Iāll discuss the final results after the event (itās this weekend). If anyone else from these forums attends, shoot me a PM and letās meet/talk
Even with the very hectic finish Iād do it again. Very many lessons learned. Iām confident that the results can be improved a good bit and have some ideas but little time
Jeremy gave us an excellent opportunity to deliver very tangible results and learn along the way. But It is up to us to get our selves together and produce working models.
I know that ULMFiT is a beast (sometimes), you need tones of memory and it taking a full day of worming your room just to see that the language model isnāt as good as you wanted. I get it, but it is how deep learning usually feels if it was easy there wouldnāt be any fun in doing this.
But we are so close. Letās get it done!
I mean a chat where are ppl that work on the same language model. People that care that your model got perplexity of 60, and they understand if that is good or bad. And can offer you an emoji or an animated gif.
A support group == a thread for each language.
If you are in, vote below to join a language group and start training.
The first person that votes should create a thread and link it to the first post above (you have 3 votes):
0 voters
0 voters
Kristian, that is a lot of work. German is a very well supported language so the competition is strong. Getting the additional twitter data was a smart move.
If you want to team up and still try to beat the SOTA we can work together?
I have a working sentence piece implementation it could address the very long words that German sometimes and you have the additional data, maybe this will help?
If you are in vote on a little pool above here and I will create a separate thread to share the results and to work together. Then we can divide the work and start experimenting. It is after the competition so there is no need for secrecy and the work can be public.
I would suggest ensembling the SentencePiece and word-level models (which along with forwards and backwards models mean youāll be ensembling 4 models in total).
Hi @hafidz. Thanks for checking. I should have provided my updates earlier. Hereās the current progress:
So, everything is done for language modelling. It took me a while as I am not satisfied with the model performance (perplexity) during the early first few iterations. Currently, I am hitting a roadblock at text classification tasks. Anyway, with that aside, I think the Malay language model is ready to be contributed to the model zoo. So, I will announce this shortly.
Hey folks,
I hope your day is going well. I am happy to contribute Malay language model to the model zoo.
ULMFiT in Malay language
The final validation loss was 3.38 (29.30 perplexity) and the accuracy was around 41% on Malay Wikipedia corpus.
Hyper-parameters:
em_sz = 400 # size of each embedding vector
nh = 1150 # number of hidden activations per layer
nl = 3 # number of layers
wd = 1e-7
bptt = 70
bs = 64
opt_fn = partial(optim.SGD, momentum=0.9)
weight_factor = 0.3
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15]) * weight_factor
learner.clip = 0.2
lr = 8 # learning rate
Training:
learner.fit(lr, 1, wds=wd, cycle_len=10, use_clr=(10,33,0.95,0.85), best_save_name='best_lm_malay_1cycle')
# Training loss history
epoch trn_loss val_loss accuracy
0 4.114716 3.936571 0.367859
1 3.83864 3.711561 0.382893
2 3.669321 3.603781 0.391633
3 3.63252 3.560706 0.394518
4 3.478959 3.513905 0.399009
5 3.518267 3.480469 0.401523
6 3.409158 3.465206 0.402808
7 3.426483 3.437133 0.405097
8 3.296175 3.409095 0.409595
9 3.185208 3.377671 0.413643
I have tried to speed up training using Leslie Smithās work on 1cycle policy that he described the super-convergence phenomenon. The model was trained using an implementation of this method in fastai libraryāCyclical Learning Rate (CLR). Interestingly, based on my own experiments and observations with this method, the AWS-LSTM model converged faster, instead of 15 epochs, it took just 10 epochs.
It took me around 1 hour 24 minutes to train 1 epoch on one Tesla K80 GPU. The full training took me around 14 hours.
I think thereās room for further improvements. Next up, I plan to build a Singlish language model.
@cedric, nice to see progress, on the Malay language. Can you explain how many words do you have in your vocabulary?
We should check this model on the downstream tasks, like text classificaiton, the issue with language modelling is that you can have superb perplexity if you have small / too small vocabulary, and this perplexity donāt necessary translate to good performance of downstream tasks.
If you can find any competition for Malay, try what we are doing for Polish:
Find any text corpus that you can classify: eg.
Since such data set would be new you wonāt know SOTA but applying text classification without pretrating and with pretaring will give you a good baseline it will show how pretraining help.
Besides google recently release data set search, you may try to find Malay there:
https://toolbox.google.com/datasetsearch
@cedric, @hafidz hat would you say for creating a separate thread for Malay and discuss this there, that way you will have history of your work in one place easy to check have a look how it works for german ULMFiT - German
Great idea. Would love to contribute if thereās more tasks to be done. Iāve created a page for Malay for further discussion for anyone whoās interested. ULMFiT - Malay
Hey, thank you for your reply and your efforts on organizing this thread. Nice job.
60, 000.
Thanks for the tips. I agree. I will work on the downstream tasks soon.
OK, I will take a look.
I found one or two small text corpus hidden in some academic papers published by Malaysiaās local universities. Still evaluating if this corpus is suitable for building and training the model. Good thing is, thereās already an existing benchmark so I can compare my model against it.
Yes, I am aware of Google Data Set Search and have tried finding there and found nothing
Does any of you have tips on how to further reduce my training loss & Accuracy?
Iām currently creating a language model based on the sentiment140 twitter dataset. I already tried varying vocabulary sizes (50k, 25k) Adam optim with low lr, SGD with momentum and high lr, different embedding sizes, hidden layer sizes, batch sizes, different loss multiplication ratios, but no matter what I try I canāt seem to get it below a value of 0.417, but this took 30 epochs. While I see you guys easily getting below 0.4 in just 2 epochs.
My dataset has these properties:
trainingset length: 1.440.000
validation length: 160.000
unique words: 321.439
max vocab used: 50.000/25.000 (min freq. of 4 returns 52k~)
len(np.concatenate(trn_lm)): 22.498.795
settings:
chunksize: 50.000
em_sz,nh,nl: 400,1100,3 (Would smaller sizes be better for smaller datasets?)
bptt: 70
bs: 50
opt_fn: optim.SGD, momentum=0.9
drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15], dtype=āfā)*0.5
use_clr_beta=(10,20,0.95,0.85)
Following these I decided to use an lr around 18, but the accuracy seems to fall with rising lr, so should I stick to something between 0 and 2.5 then?
the resulting training looks like this:
epoch trn_loss val_loss accuracy
0 4.885424 4.765574 0.204343
1 4.68609 4.581247 0.219108
2 4.588282 4.500839 0.226138
3 4.54668 4.470822 0.227034
4 4.514667 4.445856 0.229765
5 4.476595 4.433705 0.23107
6 4.479217 4.425251 0.231592
7 4.452099 4.431449 0.230048
8 4.44206 4.419237 0.232063
9 4.436647 4.417188 0.232431
10 4.43317 4.412861 0.232667
11 4.422395 4.413309 0.232941
12 4.414105 4.402681 0.234613
13 4.425107 4.39716 0.234751
14 4.387628 4.395168 0.235595
15 4.402883 4.386707 0.235551
16 4.363533 4.378289 0.238221
17 4.357185 4.37697 0.237533
18 4.367101 4.368633 0.237971
19 4.313777 4.360797 0.240501
20 4.291882 4.358919 0.239816
21 4.281025 4.346954 0.242128
22 4.27367 4.337309 0.243213
23 4.240626 4.327436 0.244454
24 4.203354 4.322042 0.245484
25 4.24484 4.316995 0.245593
26 4.242165 4.313355 0.246129
27 4.175661 4.311628 0.246528
28 4.162489 4.308656 0.247344
29 4.17869 4.30674 0.247567
It seems to keep improving the longer I learn, but I canāt let it learn for too long, because I still need to use this computer to work, which I canāt while itās doing the learningā¦
Hi Christine,
Thanks for your post, itās fascinating that your model is generating such coherent text!
Iām trying something similar but experiencing painfully slow training time using the AWD LSTM base model. I collected a huge set (5.5 billion tokens) of medical text, but quickly found that training on that much would literally take months. I culled the set down to 250 million tokens, but training is still 8.5 hr/epoch on an AWS p2 EC2 instance. I was curious how large your training corpus is.
I read this post by Jeremy (Language Model Zoo š¦) that said 100 million tokens is the most our LMs should need, but I feel like a highly technical corpus might necessitate a larger corpus.
Thanks!
-Bill
Why did you pick that? Thatās much higher than the charts suggest or we ever use in the course - you might want to re-watch that part of the lesson. Try an LR of 1.0 perhaps.
So I ran lr_find instead of lr_find2 this time and if I read the graph correctly and follow your advice from the lesson of taking the point with the highest learning rate, where the loss is still strongly decreasing then following this graph:
The choice would be around 10e1, right? But doesnāt that seem absurdly high?