Hey, I am pleased to introduce you to our State-of-the-Art Language Modeling and text classification in Malay language with perplexity of 29.30245 on Malay Wikipedia and 77.5% accuracy on DevCon’s Malaya dataset.
Summary: the benchmark shows that using ULMFiT for text classification currently outperforms models built using classical machine learning or other neural networks.
Benchmark
Performance and result of various models for LM and sentiment analysis:
Type | Model | Dataset | Metric | Value |
---|---|---|---|---|
Language Model | ULMFiT | Malay Wikipedia | Perplexity | 29.30245 |
Classification | Multinomial | Malaya | Accuracy | 0.73 |
Classification | XGBoost | Malaya | Accuracy | 0.71 |
Classification | Bahdanau | Malaya | Accuracy | 0.66 |
Classification | Bidirectional | Malaya | Accuracy | 0.69 |
Classification | Luong | Malaya | Accuracy | 0.64 |
Classification | Hierarchical | Malaya | Accuracy | 0.70 |
Classification | fastText | Malaya | Accuracy | 0.71 |
Classification | ULMFiT | Malaya | Accuracy | 0.77 |
My notebooks
About this project
Please go to my GitHub repo for more details: https://github.com/cedrickchee/data-science-notebooks/tree/master/notebooks/deep_learning/ULMFiT
This project is part of the Language Model Zoo. The previous thread where we do our discussions.
Note: there’s no paper with the results I cited above.
Source: figures for the Malaya models comparison.
Thank you for checking this out.