ULMFiT - Indonesian

cahya · October 23, 2018, 11:28am

Results:
Language models:
Indonesian language model with a perplexity of 27.67
Data set & benchmarks:
A Text Classification using Word Bahasa Indonesia Corpus and Parallel English Translation dataset from PAN Localisation has been performed and compared to other algorithms such as Naive Bayes (NB), Linear Classifier (LC), Support Vector Machine (SVM), Random Forest (RF), Extreme Gradient Boosting(Xgb), Convolition Neural Network (CNN), LSTM or GRU:

Name	Accuracy
NB, Count Vectors	0.9269
LC, Count Vectors	0.9265
RF, WordLevel TF-IDF	0.8338
Xgb, WordLevel TF-IDF	0.8070
CNN	0.9263
Kim Yoon’s CNN	0.9163
RNN-LSTM	0.9305
RNN-GRU	0.9296
Biderectional RNN	0.9267
RCNN	0.9221
ULMFit	0.9563

ahmadarib · October 24, 2018, 10:17am

cool, thanks @cahya bookmarked for later

piotr.czapla · October 24, 2018, 8:00pm

Awesome work @cahya Can you share the paper with the results you cited above, if we get more models we can write a collective paper or blog post that shows how ULMFiT revolutionised all the text classification in so many languages!

piotr.czapla · October 24, 2018, 8:14pm

@jeremy have a look @cahya managed to get error reduced by 37% (0.9563 - 0.9305) / (1-0.9305) on Indonesian document classification.

piotr.czapla · October 24, 2018, 8:33pm

@cahya do you have twitter handle?

cahya · October 24, 2018, 9:23pm

Hi @piotr.czapla, actually I did my self all of the text classifications (from Naive Bayes to ULMFiT), since it is still a challenge to find curated or publicly available Indonesian text dataset. Luckily, I found eventually a small curated dataset for text classification (with 4 classes and around 24K sentences). Obviously, there is no comparable text classification’s result using this dataset before.

Here is the link to the text classification using ULMFiT and using all other techniques.

cahya · October 24, 2018, 9:34pm

Actually I did not really know/realise that it reduces the error by 37% as you said. I just though that it only improves the accuracy by 0.0258 point (0.9563 - 0.9305), which sound not so big for me
Anyway, as requested here is my twitter account: @CahyaWr

jeremy · October 25, 2018, 4:39am

This is great! And @piotr.czapla thanks a lot for the mention and twitter handle request.

piotr.czapla · February 12, 2019, 3:43pm

@Cahya is “Word Basha Indonesia Corpus” a news classification task?

cahya · February 12, 2019, 5:27pm

@piotr.czapla yes, this is a news classification task. They call the corpus as “500,000 Word Bahasa Indonesia Corpus and Parallel English Translation” (http://www.panl10n.net/english/OutputsIndonesia2.htm). The dataset was collected from various Indonesian online news site, and translated to english for the purpose of Statistical Machine Translation Framework in 2008, the time when there was no machine learning for the machine translation . Sadly, this is the only public and curated dataset for Indonesian text classification I could find in internet.

vikassb · February 23, 2020, 9:02am

Hi @cahya
I got a freelancing gig related to Indonesian language & I went through your repo & as code of FastAI changed since then, I was unable to use the weights .
But using the data you prepared , I tried to update the notebook.Although Accuracy (text classification) is quite comparable , perplexity is quite high.
If you can guide me to make it better , that will be great.
Repo :https://github.com/VikasSinghBhadouria/Indonesian_Language_Modelling

cahya · February 24, 2020, 4:39pm

Hi @vikassb
Few months ago I used https://github.com/n-waves/multifit to create a new indonesian LM, I will upload it to the repository. The result (perplexity and accuracy were similar I think)