@cedric, nice to see progress, on the Malay language. Can you explain how many words do you have in your vocabulary?
We should check this model on the downstream tasks, like text classificaiton, the issue with language modelling is that you can have superb perplexity if you have small / too small vocabulary, and this perplexity don’t necessary translate to good performance of downstream tasks.
If you can find any competition for Malay, try what we are doing for Polish:
-
Find any text corpus that you can classify: eg.
- Newspaper Articles: Bussines, Politics, Sport, Fashion etc…
- Sentiment on user comments (we are working with polish version of goodreads to obtain comments)
- Worst case just classify if something is from Wikipedia or from Newspaper
-
Since such data set would be new you won’t know SOTA but applying text classification without pretrating and with pretaring will give you a good baseline it will show how pretraining help.
Besides google recently release data set search, you may try to find Malay there:
https://toolbox.google.com/datasetsearch
@cedric, @hafidz hat would you say for creating a separate thread for Malay and discuss this there, that way you will have history of your work in one place easy to check have a look how it works for german ULMFiT - German