Language Model in Thai

This sounds like a pretty interesting project to me :slight_smile:

I’ve usually tried to set the min count a bit lower than that, but I haven’t seen any strong rules about where the values should be.

I’d leave the extra symbols in if you think they have meaning, an question mark in english can really change the tone of a sentence (you’re talking to me / you’re talking to me?)

I think your project sounds more like word2vec than resnet :slight_smile: It would probably be easiest to save the encoder and share that file around, anyone who’s trying to build a language model is going to be pretty cluey.

Thanks for the comment!

Yes, you’re right. I’m thinking about saving the vectors to a commonly used format like .vec of fasttext, then people can take it and shove it into their models. On the other hand, though, I also wanted to make a ‘fast-and-dirty’ way to do Thai text classification by allowing users to replace the last layer like transfer learning in images.

1 Like

Most of the language models that I’ve seen are LSTMs, so retraining for transfer learning isn’t quite as simple as just replacing the last layer - but definitely possible. Maybe you could publish a working notebook that would allow someone to edit, add their own data, and get a good result?

1 Like



Quite an impressive work…! :grinning:

1 Like

Just found @jeremy 's paper on fine-tuning language model for classification. Maybe I’ll try to have a junior version of that done.

1 Like

Thanks bruh

I’ve found a vocab size of about 30,000 is a good compromise between performance and utility. My guess is min_freq of 50 may be too high - see how large your vocab is as a results.

Symbols can be an important part of language. I leave them in.

Providing a pre-trained model would be great! Perhaps github large file storage (LFS)? Or use Dreamhost? I can put it on our dreamhost account if you’re interested.

1 Like

Hi Jeremy. Thank you for your reply! I’ve managed to get min_count = 10 with smaller batch size. I eyeballed the words with lower frequency than that and found them to be some very random “words” as miscut by the segmentation algorithm. (thaiwordsareusuallywrittenwithoutspaces), so I think 10 is reasonable. This gave me about 50k embeddings.

I think I’ve saved both the models and the embeddgins with lfs in this repo Next step is trying to do the “transfer learning” of this model for text classification as shown with the sentiment analysis in lesson 4 notebook.


Updates: the “transfer learning” approach was a success! I think I’ve probably achieved state-of-the-art in Thai text classification at 94.4% for the most common open source dataset available. I said “probably” because I couldn’t find any previous literature but Facebook’s fastText can only get 65% (it’s a four-label classification problem). The most surprising thing is that I minimally changed anything from lesson 4 notebook!!

We follow Howard and Ruder (2018) approach on finetuning language models for text classification. The language model used is the one previously trained–the version of AWD LSTM Language Model. The dataset is NECTEC’s BEST, which is labeled as article, encyclopedia, news and novel. We preprocessed to remove the segmentation token and used an 80/20 split for training and validation. This resulted in 119241 sentences in the training and 29250 sentences in the validation set. We achieved 94.4% accuracy of four-label classification using the finetuning model as compared to 65.2% by fastText using their own pretrained embeddings.


My question now if anyone’s still following this thread: How do I package the model nicely into a library (I’m contributing to an open source one called pyThaiNLP) for people to use without having to download all the pytorch dependencies?


Exciting news! How about making it a conda package and make an environment.yml file like we do?


I will read up on that. Thanks!

@charin, exciting work, thanks for sharing!

I want to create a dataset of language text from materials science papers to see if we can predict the authors’ core training/expertise, irrespective of their formal affiliation. For example a chemist could think/write like a mathematician etc. It will be a better algorithm for recruiting.

Assumption is that these characteristics of language cannot be learnt from the imdb or wikipedia dataset. Is that right?

Do you have any pointers on how I can start creating such a dataset? Easiest source are pdf files; can the text from pdf files be copied to create the dataset or does it have to be html files?

Yes, I used this website before. It is very easy to use without the need for installation.

Apache Tika should work well for a large volume of dataset in multiple formats. But, I haven’t tried it yet.

Thanks @Moody ! The Apache Tika seems more like what I need for batch processing pdf files. If I can access html versions of the papers, do you know of a way to convert that into the dataset format?

@Moody also found this NLTK based solution to create a corpus from *.txt files:

I am guessing this is a way to basically format the txt from multiple files and put them in one file? Also I don’t see too many people talking about using NLTK - is there a reason?

I am going to give this solution a try:

I think we need to separate the problem into two parts:

  • Language modeling: this one definitely for research papers I think you can use wikipedia to train
  • Text classification: as I’ve done with Thai language and Jeremy has done with sentiment analysis, you can do transfer learning from the language model you’ve trained IF you have a dataset of papers already labeled with author’s expertise.

I think the most difficult part would be the labeling of the papers.

Woah ! This is pretty neat. I might have some future questions for you. :clap:

Absolutely welcome! Non-Roman language solidarity!