Indonesian language model with a perplexity of 27.67
Data set & benchmarks:
A Text Classification using Word Bahasa Indonesia Corpus and Parallel English Translation dataset from PAN Localisation has been performed and compared to other algorithms such as Naive Bayes (NB), Linear Classifier (LC), Support Vector Machine (SVM), Random Forest (RF), Extreme Gradient Boosting(Xgb), Convolition Neural Network (CNN), LSTM or GRU:
|NB, Count Vectors
|LC, Count Vectors
|RF, WordLevel TF-IDF
|Xgb, WordLevel TF-IDF
|Kim Yoon’s CNN
cool, thanks @cahya bookmarked for later
Awesome work @cahya Can you share the paper with the results you cited above, if we get more models we can write a collective paper or blog post that shows how ULMFiT revolutionised all the text classification in so many languages!
@jeremy have a look @cahya managed to get error reduced by 37% (0.9563 - 0.9305) / (1-0.9305) on Indonesian document classification.
@cahya do you have twitter handle?
Hi @piotr.czapla, actually I did my self all of the text classifications (from Naive Bayes to ULMFiT), since it is still a challenge to find curated or publicly available Indonesian text dataset. Luckily, I found eventually a small curated dataset for text classification (with 4 classes and around 24K sentences). Obviously, there is no comparable text classification’s result using this dataset before.
Here is the link to the text classification using ULMFiT and using all other techniques.
Actually I did not really know/realise that it reduces the error by 37% as you said. I just though that it only improves the accuracy by 0.0258 point (0.9563 - 0.9305), which sound not so big for me
Anyway, as requested here is my twitter account: @CahyaWr
This is great! And @piotr.czapla thanks a lot for the mention and twitter handle request.
@Cahya is “Word Basha Indonesia Corpus” a news classification task?
@piotr.czapla yes, this is a news classification task. They call the corpus as “500,000 Word Bahasa Indonesia Corpus and Parallel English Translation” (http://www.panl10n.net/english/OutputsIndonesia2.htm). The dataset was collected from various Indonesian online news site, and translated to english for the purpose of Statistical Machine Translation Framework in 2008, the time when there was no machine learning for the machine translation . Sadly, this is the only public and curated dataset for Indonesian text classification I could find in internet.
I got a freelancing gig related to Indonesian language & I went through your repo & as code of FastAI changed since then, I was unable to use the weights .
But using the data you prepared , I tried to update the notebook.Although Accuracy (text classification) is quite comparable , perplexity is quite high.
If you can guide me to make it better , that will be great.
Few months ago I used https://github.com/n-waves/multifit to create a new indonesian LM, I will upload it to the repository. The result (perplexity and accuracy were similar I think)