ULMFit - Thai


Type Dataset Validation Metric Value Model
Language Model Thai Wiki Perplexity 46.6051 LSTM
Language Model Thai Wiki Perplexity 46.80959 QRNN
Language Model Wongnai Perplexity 42.2135 LSTM
Language Model Wongnai Perplexity 52.57522 QRNN
Classification Wongnai Micro F1 0.60925 LSTM
Classification Wongnai Micro F1 0.520333 QRNN
Classification Wongnai Micro F1 0.57057 pretrained BERT

Edit notes:

  • Changed to fastai 1.0 API for all models
  • Validation set for Wiki language model standardized as 20%; 30s perplexity was due to 1% validation set
  • Changed classification metric to accuracy to standardize across future datasets

Github Repo: https://github.com/cstorm125/thai2fit


Awesome, lovely table! Thank you! I’ve added it to the main wiki in the Language Model Zoo

@cstorm125 do you have the previous state of the art values? Can you confirm that the classification is sentiment analysis?

This is the first benchmark for Thai text classification. I will try to find more in literature. As far as I know the previous state-of-the-art was randomly initialized LSTM at 0.58 and tranfer learning is at 0.61 (micro-F1).

The task was to classify restaurant reviews into 1 to 5 stars. I considered it sentiment analysis.

1 Like

@piotr.czapla I’m trying to incorporate my works into ulmfit-multilingual but not sure what to do with tokenization and text normalization since Thai needs a different methods than the Moses tokenizer used. Do you have an idea?