ULMFiT for Malay Language Project

(Cedric Chee) #1

Hey, I am pleased to introduce you to our State-of-the-Art Language Modeling and text classification in Malay language with perplexity of 29.30245 on Malay Wikipedia and 77.5% accuracy on DevCon’s Malaya dataset.

Summary: the benchmark shows that using ULMFiT for text classification currently outperforms models built using classical machine learning or other neural networks.


Performance and result of various models for LM and sentiment analysis:

Type Model Dataset Metric Value
Language Model ULMFiT Malay Wikipedia Perplexity 29.30245
Classification Multinomial Malaya Accuracy 0.73
Classification XGBoost Malaya Accuracy 0.71
Classification Bahdanau Malaya Accuracy 0.66
Classification Bidirectional Malaya Accuracy 0.69
Classification Luong Malaya Accuracy 0.64
Classification Hierarchical Malaya Accuracy 0.70
Classification fastText Malaya Accuracy 0.71
Classification ULMFiT Malaya Accuracy 0.77

My notebooks

About this project

Please go to my GitHub repo for more details: https://github.com/cedrickchee/data-science-notebooks/tree/master/notebooks/deep_learning/ULMFiT

This project is part of the Language Model Zoo. The previous thread where we do our discussions.

Note: there’s no paper with the results I cited above.

Source: figures for the Malaya models comparison.

Thank you for checking this out.

ULMFiT - Malay
ULMFiT - Russian
A simple request
Language Model Zoo :gorilla:
(Piotr Czapla) #2

@jeremy, the language model zoo is accelerating! Have a look at this amazing 15% improvement in error rate over the best baseline! And it is so far the nicest looking summary :slight_smile: Congrats @cedric!

(Vijay Narayanan Parakimeethal) #3

Great Work @cedric! Inspired by you to see if I can try this out on some of the Indian languages.

(Cedric Chee) #4

Thanks. Please feel free to contribute to the Indian languages. A lot of LMs have been shared on that Language Model Zoo thread but mostly without any baseline comparisons.

  • nirantk has claimed SOTA perplexity for LM in Hindi, although he doesn’t specify any baseline comparisons. There is no other work beyond bigrams in his knowledge. He has been working on a similar datasets and LM for Gujarati. Maybe you should collaborate with him and see if it can be taken further?

  • Vishucyrus successfully trained Sanskrit LM but the results are suspicious.

  • binga created a starter kit (notebooks) on Telegu LM.

  • …and many more.