ULMFiT - Indonesian

(Cahya Wirawan) #1

Results:
Language models:
Indonesian language model with a perplexity of 27.67
Data set & benchmarks:
A Text Classification using Word Bahasa Indonesia Corpus and Parallel English Translation dataset from PAN Localisation has been performed and compared to other algorithms such as Naive Bayes (NB), Linear Classifier (LC), Support Vector Machine (SVM), Random Forest (RF), Extreme Gradient Boosting(Xgb), Convolition Neural Network (CNN), LSTM or GRU:

Name Accuracy
NB, Count Vectors 0.9269
LC, Count Vectors 0.9265
RF, WordLevel TF-IDF 0.8338
Xgb, WordLevel TF-IDF 0.8070
CNN 0.9263
Kim Yoon’s CNN 0.9163
RNN-LSTM 0.9305
RNN-GRU 0.9296
Biderectional RNN 0.9267
RCNN 0.9221
ULMFit 0.9563
6 Likes

Language Model Zoo :gorilla:
(Ahmad Arib) #2

cool, thanks @cahya bookmarked for later :smiley:

0 Likes

(Piotr Czapla) #3

Awesome work @cahya Can you share the paper with the results you cited above, if we get more models we can write a collective paper or blog post that shows how ULMFiT revolutionised all the text classification in so many languages!

0 Likes

(Piotr Czapla) #4

@jeremy have a look @cahya managed to get error reduced by 37% (0.9563 - 0.9305) / (1-0.9305) on Indonesian document classification.

0 Likes

(Piotr Czapla) #5

@cahya do you have twitter handle?

0 Likes

(Cahya Wirawan) #6

Hi @piotr.czapla, actually I did my self all of the text classifications (from Naive Bayes to ULMFiT), since it is still a challenge to find curated or publicly available Indonesian text dataset. Luckily, I found eventually a small curated dataset for text classification (with 4 classes and around 24K sentences). Obviously, there is no comparable text classification’s result using this dataset before.

Here is the link to the text classification using ULMFiT and using all other techniques.

2 Likes

(Cahya Wirawan) #7

Actually I did not really know/realise that it reduces the error by 37% as you said. I just though that it only improves the accuracy by 0.0258 point (0.9563 - 0.9305), which sound not so big for me :slight_smile:
Anyway, as requested here is my twitter account: @CahyaWr

0 Likes

(Jeremy Howard (Admin)) #8

This is great! And @piotr.czapla thanks a lot for the mention and twitter handle request.

1 Like

(Piotr Czapla) #9

@Cahya is “Word Basha Indonesia Corpus” a news classification task?

0 Likes

(Cahya Wirawan) #10

@piotr.czapla yes, this is a news classification task. They call the corpus as “500,000 Word Bahasa Indonesia Corpus and Parallel English Translation” (http://www.panl10n.net/english/OutputsIndonesia2.htm). The dataset was collected from various Indonesian online news site, and translated to english for the purpose of Statistical Machine Translation Framework in 2008, the time when there was no machine learning for the machine translation :slight_smile: . Sadly, this is the only public and curated dataset for Indonesian text classification I could find in internet.

1 Like