ULMFiT for Malay Language Project

cedric · November 8, 2018, 9:49am

Hey, I am pleased to introduce you to our State-of-the-Art Language Modeling and text classification in Malay language with perplexity of 29.30245 on Malay Wikipedia and 77.5% accuracy on DevCon’s Malaya dataset.

Summary: the benchmark shows that using ULMFiT for text classification currently outperforms models built using classical machine learning or other neural networks.

Benchmark

Performance and result of various models for LM and sentiment analysis:

Type	Model	Dataset	Metric	Value
Language Model	ULMFiT	Malay Wikipedia	Perplexity	29.30245
Classification	Multinomial	Malaya	Accuracy	0.73
Classification	XGBoost	Malaya	Accuracy	0.71
Classification	Bahdanau	Malaya	Accuracy	0.66
Classification	Bidirectional	Malaya	Accuracy	0.69
Classification	Luong	Malaya	Accuracy	0.64
Classification	Hierarchical	Malaya	Accuracy	0.70
Classification	fastText	Malaya	Accuracy	0.71
Classification	ULMFiT	Malaya	Accuracy	0.77

My notebooks

About this project

Please go to my GitHub repo for more details: https://github.com/cedrickchee/data-science-notebooks/tree/master/notebooks/deep_learning/ULMFiT

This project is part of the Language Model Zoo. The previous thread where we do our discussions.

Note: there’s no paper with the results I cited above.

Source: figures for the Malaya models comparison.

Thank you for checking this out.

piotr.czapla · November 8, 2018, 10:11am

@jeremy, the language model zoo is accelerating! Have a look at this amazing 15% improvement in error rate over the best baseline! And it is so far the nicest looking summary Congrats @cedric!

pnvijay · November 10, 2018, 4:59am

Great Work @cedric! Inspired by you to see if I can try this out on some of the Indian languages.

cedric · November 11, 2018, 2:19am

Thanks. Please feel free to contribute to the Indian languages. A lot of LMs have been shared on that Language Model Zoo thread but mostly without any baseline comparisons.

nirantk has claimed SOTA perplexity for LM in Hindi, although he doesn’t specify any baseline comparisons. There is no other work beyond bigrams in his knowledge. He has been working on a similar datasets and LM for Gujarati. Maybe you should collaborate with him and see if it can be taken further?
Vishucyrus successfully trained Sanskrit LM but the results are suspicious.
binga created a starter kit (notebooks) on Telegu LM.
…and many more.