Language Model Zoo 🦍


(Sam) #267

Hi! I’m sorry if this is a n00b mistake. I’m using @lesscomfortable’s Spanish LM, which he so graciously has a GDrive link to on the linked GitHub repo. However, in general, is something like the fwd_wt103.h5 model useful without the corresponding itos_wt103.pkl?

That is, without mapping the classification task’s vocab to the LM’s vocab, would we get any benefit?


(Francisco Ingham) #268

I think you are right Sam. I’ll upload the itos file tomorrow so people can use it with the model. I’ll also answer your other Github question tomorrow.


(Sam) #270

Sorry, I know this is very late, but I had this same problem… it seems that, regardless of what language you are working on, fastai requires that language and English. So just !python3 -m spacy download en … and that will fix it. :no_mouth:


(Sam) #271

@lesscomfortable - once I had the itos file, everything worked great. Accuracy is almost as high as the English classifier. Thanks!


(Francisco Ingham) #272

That’s good to hear! If you are going to make any modifications to improve performance, please let me know and we can include them in the repo.


(Fabian) #273

With the German Language Model by @t-v i tried to classifiy emails as part of a project. After fitting the last layer and running lr_find with the default fast.ai code I get following plot:

I only have ~1.300 mails as training set, might this be the reason for the unusual looking lr_find plot or did i mess something up along the way?

Best,
Fabian


(Sam) #274

I have a training set with around 5k examples, and my optimal learning rate is usually around 10^-1


(nirant) #275

ULimFit for Hindi

State of the Art Perplexity for Language Modeling

New Dataset for Hindi Text Classification Challenges:

BBC News Dataset

Call for Help

I am looking for contributors and help take this further, specifically: experiments to compare ULimFit against other classical and Deep Learning based Text Classification approaches.

Please open a Github issue!


(Francisco Rodes) #276

Hi @mollerhoj!

I saw in the first post that you were working with the Swedish model but in the post I found from you it says Danish and Norwegian, am I right?

Tell me so we can collaborate on the Swedish one or I can start working on it! :slight_smile:

Francisco


(Thomas) #277

I did try to find some state of the art, but I seemed really hard - either the dataset had language quite different from Wikipedia (my impression is that Twitter datasets seem to contain a lot of colloquial terms at least for the sb10k corpus referenced above) - or the benchmark wasn’t clear to me. I don’t know how ULMFiT does on Germeval-2017 linked above, it might be good to test that . @rother or @MatthiasBachfischer might now something more.

Best regards

Thomas


(Monique Monteiro) #278

Hi,

I’ve updated fast.ai library with “git pull” and the following error began to occur:


NameError Traceback (most recent call last)
in ()
----> 1 m = get_rnn_classifier(bptt, 2070, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
2 layers=[em_sz
3, 50, c], drops=[dps[4], 0.1],
3 dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])

NameError: name ‘get_rnn_classifier’ is not defined

Any ideas?

Thanks,
Monique


#279

There is a typo in the source code for some versions and I have encoutered the same issue. Quick fix is to either call get_rnn_classifer or to define a new function with get_rnn_classifier = get_rnn_classifer.


(Monique Monteiro) #280

Thanks, @yelh!


(Shankar) #281

Hi @pandeyanil

I can help you with Hindi and Sanskrit.

Could you please guide me on how to start?


(Piotr Czapla) #282

@shankarj67 if you haven’t yet check out the http://course.fast.ai/lessons/lesson10.html , Jeremy shows there how to train and use the language models.
Once you are ready to start there are also scripts that Jeremy and Sebastian created for the ablation studies, they are quite useful as with just command line parameter changes you can train your model, they have pretty good documentation here:
https://github.com/fastai/fastai/blob/master/courses/dl2/imdb_scripts/README.md


(Piotr Czapla) #283

@t-v, @MatthiasBachfischer, @elyase, @rother, @aayushy

GermEval 2018 has some pretty well-suited tasks for ULMFiT : Classification and Fine Grain Classification. In case you aren’t taking part in the competition already, we can train the ULMFiT with sentence piece on the competition data and we will be able to compare the results on September 21 (the workshop day).

If you took part in the competition and won, can you share your paper or provide an appropriate citation?
We won 3rd task in poleval 2018 using ULMFiT with SentencePiece as the tokenization, unfortunately, the task was just about creating language model so we couldn’t use the transfer learning. I’m looking for an example where Sentence Piece + ULMFiT achieve SOTA in down-stream tasks to justify our claims in the paper.


(Piotr Czapla) #284

If you take part in any competition with your LM’s one thing that helped us the most was to try many different parameters on a very small corpus (10M of tokens) thanks to this we could check 53 combinations in just under a day.


(Binal Patel) #286

I’ve been training a LM on clinical/medical text using the MIMIC-III database and things have been going really well. The initial model I completed today (~13 hours of training time) had a perplexity of ~15 on the validation set, with an average accuracy of 60% in predicting the next word.

The initial model is a word-level model that uses the tokenization methods from the course, this will be my base that I’ll use to compare different tokenization methods/hyper-parameters against.

The initial results seems too good to be true to me, so I’ll be digging into it a bit more to see if there’s some area where I’m allowing for information leakage, or if it’s gotten really good at predicting nonsense (for example there’s a lot of upper case in my corpus, so I wonder if it’s gotten really good at predicting the uppercase token). I’ll need to do some more research as well to see if there’s published papers that I can compare results against.

All in all it’s pretty amazing how quickly I’ve been able to set this up and get things running, thanks to everyone in this thread for sharing their work and thoughts. I’m writing up a blog post about what I’m currently doing and will share soon as well.


(Kristian Rother) #287

We have an entry for Germeval (binary task only) but I’m fairly confident that it is not that great. Unfortunately I saw the competition late and had a very heavy workload towards the end that clashed a bit with doing more. Additionally there were some technical difficulties towards the end (heatwave in Germany + computers that crunch for 3-4 days = bad combination). We deliberately kept it very vanilla ULMFiT so I just used a 50k token Wiki German LM, about 300k self collected unlabeled Tweets and just the provided training data. No ensembling. The LM and the Twitter model are pretty decent I think (<28 perplexity and <18 perplexity respectively). The classifier eventually converged (I underestimated this step) and we got an F1 of about 0.8 on the validation set which I’d been very happy with but a rather disappointing score for the test set. I’ll discuss the final results after the event (it’s this weekend). If anyone else from these forums attends, shoot me a PM and let’s meet/talk :slight_smile:

Even with the very hectic finish I’d do it again. Very many lessons learned. I’m confident that the results can be improved a good bit and have some ideas but little time :slight_smile:


ULMFiT - German
(Piotr Czapla) #288

Let’s clean up and get ULMFiT working on our languages

Jeremy gave us an excellent opportunity to deliver very tangible results and learn along the way. But It is up to us to get our selves together and produce working models.

I know that ULMFiT is a beast (sometimes), you need tones of memory and it taking a full day of worming your room just to see that the language model isn’t as good as you wanted. I get it, but it is how deep learning usually feels :slight_smile: if it was easy there wouldn’t be any fun in doing this.

But we are so close. Let’s get it done!

How about a multiple self-support groups?

I mean a chat where are ppl that work on the same language model. People that care that your model got perplexity of 60, and they understand if that is good or bad. And can offer you an emoji or an animated gif.

A support group == a thread for each language.

If you are in, vote below to join a language group and start training.
The first person that votes should create a thread and link it to the first post above (you have 3 votes):

  • Bengali
  • Chinese (Simplified)
  • Chinese (Traditional)
  • Danish
  • Esperanto
  • Estonian
  • Finnish
  • French
  • German
  • Hebrew
  • Hindi
  • Italian
  • Indonesian
  • Japanese
  • Korean

0 voters

  • Malay
  • Malayalam
  • ** Medical
  • ** Music (generating music in the style of Mozart & Brahms)
  • Norwegian
  • Polish
  • Portuguese
  • Russian
  • Sanskrit
  • Spanish
  • Swahili
  • Swedish
  • Tamil
  • Telugu
  • Thai
  • ** isiXhosa4

0 voters