ULMFit - Thai

cstorm125 · October 27, 2018, 5:02pm

Results

Type	Dataset	Validation Metric	Value	Model
Language Model	Thai Wiki	Perplexity	46.6051	LSTM
Language Model	Thai Wiki	Perplexity	46.80959	QRNN
Language Model	Wongnai	Perplexity	42.2135	LSTM
Language Model	Wongnai	Perplexity	52.57522	QRNN
Classification	Wongnai	Micro F1	0.60925	LSTM
Classification	Wongnai	Micro F1	0.520333	QRNN
Classification	Wongnai	Micro F1	0.57057	pretrained BERT

Edit notes:

Changed to fastai 1.0 API for all models
Validation set for Wiki language model standardized as 20%; 30s perplexity was due to 1% validation set
Changed classification metric to accuracy to standardize across future datasets

Github Repo: https://github.com/cstorm125/thai2fit

piotr.czapla · October 28, 2018, 4:45pm

Awesome, lovely table! Thank you! I’ve added it to the main wiki in the Language Model Zoo

piotr.czapla · February 11, 2019, 5:35pm

@cstorm125 do you have the previous state of the art values? Can you confirm that the classification is sentiment analysis?

cstorm125 · February 12, 2019, 1:37pm

This is the first benchmark for Thai text classification. I will try to find more in literature. As far as I know the previous state-of-the-art was randomly initialized LSTM at 0.58 and tranfer learning is at 0.61 (micro-F1).

The task was to classify restaurant reviews into 1 to 5 stars. I considered it sentiment analysis.

cstorm125 · February 18, 2019, 4:03am

@piotr.czapla I’m trying to incorporate my works into ulmfit-multilingual but not sure what to do with tokenization and text normalization since Thai needs a different methods than the Moses tokenizer used. Do you have an idea?