Language Model Zoo šŸ¦

Hi everyone!
I have started to train an AWD LSTM model using v1 of fastai. While I was completely fascinated by the ease of use (it took, like, 5 lines of code to get started) and flexibility of the framework, I have been running into technical problems. I mostly use default parameters, only tweaking Adamā€™s betas and the learning rate, my corpus is 110 million tokens split 90/10 into train/validation. The first epoch goes on mostly fine, though memory utilization of GPU is around 99% from start, but when I start another epoch, I get Cuda OOM error. This prevents me from using cyclical learning rates. Sometimes I get OOM at the end of the first epoch. Cutting down on bptt leads to slower convergence (and probably worse outcome).
Did anyone have this problem and found a solution? My setup is a deep learning image on GCP with K80 (12Gb).

I was just looking into doing this for Turkish. Glad to have found this thread.

1 Like

You can lower the batch size. The training will be longer, but with the good size, you should not have memory errors.

OOM at the end of the first epoch may be caused by too large batch size. Validation does not use sampled softmax so it has larger memory requirement. In original ULMFiT scripts the batch size was cut 5 times for validation

    trn_dl = LanguageModelLoader(trn_lm, bs, bptt, batch_sets=batch_sets)
    val_dl = LanguageModelLoader(val_lm, bs//5 if sampled else bs, bptt, batch_sets=1)

Because of that.

1 Like

@anz9990
Awesome work! Do you know what is the STOA for Japanese IMDB?

If you are above or near STOA can you share your weights and point me the the publication that shows what is the current stoa on imdb?

It would be awesome if you would update the wiki thread above with your findings by start a new thread like ULMFiT - Japanese and putting your SOTA and your results in a table. Here is an example

Thanks! All I really did was use the lesson notebook and make a few changes for my dataset :wink:
There isnā€™t a Japanese IMDB dataset, the one I used was from Yahoo Movie Reviews that someone kindly put on their repository.

This is the latest publication Iā€™ve found on Japanese sentiment classification. I guess since Iā€™m around 90-91% Iā€™m pretty close to SOTA but its really not comparable because they are using a different dataset.

The datasets they are using for benchmarks are available for download after being subject to an application review by the research organization that curates them. And they only accept applications if you are a researcher at a University/Research Institution. Since Iā€™m not I canā€™t get access to it.

I can share the weights for this model and the ja-wiki language model that I trained for transfer learning if thatā€™s useful.

2 Likes

@cstorm125 can you create a thread for Thai results and post them in the wiki thread above similar to other languages that are either done or are work in progress? So that ppl can quickly check where we are and join on the languages where the work is still on going?

1 Like

Done

2 Likes

I had a look, the request form is quite long ! We have the same in polish there is one org. that has the largest dataset of polish sentences but they donā€™t give access to ppl because of legal stuff. Fortunately, they offered to run our models on their data and maybe publish weights, which is good enough for me.

Maybe you can try to drop them an email? I would do it for you but Iā€™m afraid that given the fact that the whole website is in Japanse they wonā€™t appreciate English :).

But either way, create ULMFit - Japanse and post your work there we can improve upon it.

Maybe try reducing your batch size ? I think the default is 64, maybe try 48 or 32.

Had already asked this in the Malay-focused thread, but re-posting it here for a wider audience.

If there are no local research being done on top of a fixed dataset/corpus (ie IMDB), how does one actually establish that our results are ā€œstate of the artā€?

1 Like

I have similar issue with sentiment analysis for Polish and Iā€™m going to compare it against it self: model without pre-training on wikipedia and with pre-training, and against cloud services that are available for Polish. I think this should be good enough and Iā€™m working on getting a proper sentiment analysis dataset for polish with guys that organised poleval.

@lesscomfortable can you create Spanish thread and get a summary of your implementation of ULMFiT?

1 Like

@sgugger I know you must be super busy but could you create a French thread and describe what you achieved there so far? (btw. make the first message wiki)

@nandobr @monilouise @saulberardo, guys have you managed to do something with Portuguese? If so can you create a thread and describe what is the dataset you were testing your models on and where area you with tests?

@shoof, @Moody where are you regarding Chinese, would you mind starting a thread so that we can join the development?

@mollerhoj have you tried to run the ULMFiT against any classification task?

Absolutely! Delighted to have you join the development on Chinese! Iā€™ve read your paper and learned quite a lot of things regarding sentencepiece :slight_smile:

1 Like

Hi @sgugger , I want to train Fastai v1 LM in portuguese. Can you please tell me where I can set language = ā€œptā€ ? Spacy version is 2.0.16 or 2.0.17? I saw you spent around 1 hour/epoch. Which vm were you using ?

Cool! Do you want to create Portuguese thread?